CN108415950A - A kind of hypernym polymerization and device - Google Patents
A kind of hypernym polymerization and device Download PDFInfo
- Publication number
- CN108415950A CN108415950A CN201810100677.0A CN201810100677A CN108415950A CN 108415950 A CN108415950 A CN 108415950A CN 201810100677 A CN201810100677 A CN 201810100677A CN 108415950 A CN108415950 A CN 108415950A
- Authority
- CN
- China
- Prior art keywords
- hypernym
- pending
- entity type
- word
- similarity
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
- G06F16/355—Class or cluster creation or modification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/36—Creation of semantic tools, e.g. ontology or thesauri
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/36—Creation of semantic tools, e.g. ontology or thesauri
- G06F16/367—Ontology
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Animal Behavior & Ethology (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The present invention relates to the information processing technology, more particularly to a kind of hypernym polymerization and device.To improve the accuracy of hypernym polymerization.This method is:The word vector that terminal device includes according to each hypernym calculates the term vector similarity between each hypernym, and the entity type associated by the corresponding entity of each hypernym calculates the entity type similarity between each hypernym, and term vector similarity is reached into the first pre-determined threshold and entity type similarity reaches each hypernym of the second pre-determined threshold and polymerize;In this way, short text as similar hypernym can be effectively treated, both the text key information that hypernym includes can effectively have been excavated, the type feature of hypernym can be accurately depicted again, simultaneously not only can be to avoid the miscellaneous work amount of artificial design features, but also the generalization ability of model can be enhanced, efficiently identify invalid hypernym, the redundant data in hypernym is removed, the accuracy of hypernym polymerization is significantly improved.
Description
Technical field
The present invention relates to the information processing technology, more particularly to a kind of hypernym polymerization and device.
Background technology
In the hypernym network that knowledge based collection of illustrative plates generates, in order to avoid there is hypernym redundancy issue, it usually needs
Hypernym with identical semanteme is polymerize, that is, is directed to same semanteme, is drawn into the hypernym using different expression ways
And it merges.Such as:Hypernym about slr camera has:" slr camera ", " slr camera being commonly called as ", " list is anti-
Camera " " LR camera " etc., it is upper that these with hypernym identical semantic but that description is different are referred to as identical semanteme
Word.The process that these identical semantic hypernyms merge and are indicated with a common name is referred to as the polymerization of hypernym
Process.Merge the redundancy issue that the hypernym with identical semanteme can reduce hypernym network, improves the matter of hypernym network
Amount.
In prior art, it will usually realize hypernym polymerization using two ways.
First way is:It is clustered primarily directed to similar semantic text.
Common method would generally indicate text using term vector, bag of words, topic model etc. feature, then sharp
With common clustering algorithm, such as:K-means, hierarchical clustering, the methods of spectral clustering obtain Similar Text set.
The relatively large number of similar semantic text of number of words can be condensed together using first way, i.e., can only meet phase
Like the polymerization task of semantic long text, polymerization accuracy is relatively low.
And it is a kind of high-precision Semantic Clustering task to carry out polymerization to the hypernym of identical semanteme, and therefore, first way
It is difficult to efficiently solve.
The second way is:Mainly from character string similar angle, it is non-to merge statement using the methods of editing distance
Normal similar short text.
The aggregation problem of the hypernym of identical semanteme can be solved using the second way, still, such mode is only caught
The character string information of hypernym is caught, and judges whether the two states the same thing by the similarity between calculating character string
Object.And in fact, the same things often has different describing modes, such as:" children " and " child ", the two is semantic the same,
But character string is entirely different.Therefore, the hypernym with similar semantic is merged by the way of based on editing distance also is had
Certain limitation.
In view of this, needing to redesign a kind of method of hypernym polymerization to overcome drawbacks described above.
Invention content
A kind of hypernym polymerization of offer of the embodiment of the present invention and device, to improve the accuracy of hypernym polymerization.
Specific technical solution provided in an embodiment of the present invention is as follows:
A kind of hypernym polymerization, including:
Obtain multiple pending hypernyms, and determine respectively the word of each word that each pending hypernym includes to
Amount;
Each word vector based on acquisition, the term vector of each pending hypernym is calculated according to special algorithm;
Each pending hypernym associated entity type in knowledge mapping is determined respectively;
Term vector based on each pending hypernym and associated entity type, it is pending upper to calculate separately each two
Term vector similarity between word and entity type similarity;
It, will when term vector similarity reaches the first pre-determined threshold, and entity type similarity reaches the second pre-determined threshold
Corresponding pending hypernym is polymerize.
A kind of hypernym polyplant, including:
First determination unit for obtaining multiple pending hypernyms, and determines that each pending hypernym includes respectively
Each word word vector, and the vector of each word based on acquisition calculates each pending hypernym according to special algorithm
Term vector;
Second determination unit, for determining each pending hypernym associated entity type in knowledge mapping respectively;
Computing unit is used for the term vector based on each pending hypernym and associated entity type, calculates separately every
Term vector similarity between two pending hypernyms and entity type similarity;
Polymerized unit reaches the first pre-determined threshold for working as term vector similarity, and entity type similarity reaches second
When pre-determined threshold, corresponding pending hypernym is polymerize.
A kind of storage medium stores the program for realizing hypernym polymerization, when described program is run by processor,
Execute following steps:
Obtain multiple pending hypernyms, and determine respectively the word of each word that each pending hypernym includes to
Amount;
Each word vector based on acquisition, the term vector of each pending hypernym is calculated according to special algorithm;
Each pending hypernym associated entity type in knowledge mapping is determined respectively;
Term vector based on each pending hypernym and associated entity type, it is pending upper to calculate separately each two
Term vector similarity between word and entity type similarity;
It, will when term vector similarity reaches the first pre-determined threshold, and entity type similarity reaches the second pre-determined threshold
Corresponding pending hypernym is polymerize.
A kind of computer installation, including one or more processors;And one or more computer-readable mediums, it is described
Instruction is stored on readable medium, when described instruction is executed by one or more of processors so that on described device executes
State any one method.
In the embodiment of the present invention, word vector that terminal device includes according to each hypernym calculates between each hypernym
Term vector similarity, and entity type associated by the corresponding entity of each hypernym calculate between each hypernym
Entity type similarity, and term vector similarity is reached into the first pre-determined threshold and entity type similarity reaches the second pre- gating
Each hypernym of limit is polymerize;Since hypernym is typically to be made of a little several words, meeting is operated using traditional participle
Larger error and information loss are brought, therefore, in the embodiment of the present invention, the word vector for including based on hypernym is characterized
Term vector and based on the associated entity type of hypernym come carry out the similarity between hypernym judge, class can be effectively treated
The short text like as hypernym not only can effectively excavate the text key information that hypernym includes, but also can accurately carve
The type feature of hypernym is drawn, while not only can be to avoid the miscellaneous work amount of artificial design features, but also mould can be enhanced
The generalization ability of type efficiently identifies invalid hypernym, removes the redundant data in hypernym, significantly improves hypernym polymerization
Accuracy.
Description of the drawings
Fig. 1 is knowledge mapping example schematic under prior art;
Fig. 2 is entity type example schematic under prior art;
Fig. 3 is that knowledge based collection of illustrative plates carries out hypernym polymerization process schematic diagram in the embodiment of the present invention;
Fig. 4 A are to be associated with schematic diagram between pending hypernym and entity in the embodiment of the present invention;
Fig. 4 B are the association schematic diagram between entity and entity type in the embodiment of the present invention;
Fig. 5 is terminal function structural schematic diagram in the embodiment of the present invention;
Fig. 6 is Computer functions of the equipments structural schematic diagram of the embodiment of the present invention.
Specific implementation mode
In order to improve the accuracy of hypernym polymerization, in the embodiment of the present invention, include by each pending hypernym
Word vector determines the term vector of each pending hypernym, and combines each pending hypernym corresponding in knowledge mapping
Entity type, to judge the semantic similarity between each pending hypernym, to will be singled out that there is the upper of identical semanteme
Position word is polymerize.This not only considers the semantic information of hypernym itself, it is also contemplated that the associated entity type letter of hypernym
Breath, therefore, can meet high-precision semantics fusion demand.
The preferred embodiment of the present invention is described in detail below in conjunction with the accompanying drawings.
For the ease of introducing background technology, first part term is defined.
Knowledge mapping:Knowledge Graph/Vault, also known as mapping knowledge domains are known as knowing in books and information group
The field of awareness visualizes or ken maps map, is explicit knowledge's development process and a series of a variety of different figures of structural relation
Shape, with visualization technique Description of Knowledge resource and its carrier, excavate, analysis, structure, draw and explicit knowledge and they between
It connects each other.
As shown in fig.1, in knowledge mapping, a node is known as an entity, and called entity is knowledge mapping
Introduce object.Such as, it is assumed that a node is " Liu ", that is, represents an entity, and the attribute that property set includes has occupation, goes out
Phase birthday and hobby, etc..
Hypernym:Hypernym refers to the wider array of descriptor of conceptive extension.
Such as:" carnivore " is the hypernym of Tiger, and " felid " can also be the hypernym of Tiger, because
This, hypernym can be understood as the cluster classification that entity is obtained according to attributive character.
Such as, Tiger can be obtained into " carnivority animal " this hypernym according to attribute " carnivority " cluster.For another example, will
Tiger can obtain " felid " this hypernym according to attribute " animal section " cluster.
Entity type:Entity in knowledge mapping all corresponds to an entity type, and entity type can be regarded as entity
Generality sort out.One entity type may include multiple entities.Such as:The entity type of entity " rose " is " plant
Class ";For another example, film《Warwolf 2》Entity type be " film class ".
For example, as shown in fig.2, entity " tiger ", " tortoise " and " butterfly " has, there are one identical entity types " animal class ".
Term vector:It is a kind of distributed expression of word, basic thought refers to that word is mapped as a fixed dimension
Vector (dimension be much smaller than dictionary size), the vector of these words constitutes term vector semantic space, semantic similar word
Distance usually in space is closer.
Word vector:It is a kind of distributed expression of " word " level, " word " is mapped in semantic space, one of word is obtained
Semantic vector, distance is closer usually in semantic space for the word vector of similar semantic.
Density interpolation vectorization method (Dense Interpolated Embedding, DIE) is a kind of based on word vector
Synthesize a kind of method of term vector, empirical evidence it can effectively indicate the character string of similar description.
In the embodiment of the present invention, in pretreatment stage, terminal device can be based on encyclopaedia language material, be instructed using word2vec tools
It practises handwriting vector, the source language material for the plain text language material and hypernym that when training word vector uses is consistent.In this manner it is ensured that
Each word that hypernym includes vector, can accurate characterization hypernym in the feature of text level, and then can be follow-up
It generates term vector and has established good basis.
Specifically, can a point word processing first be carried out to plain text:Continuous English alphabet is used as one as a word, number
A word, middle word are a word;Then, for dividing the plain text language material after word processing, using Word2vec model trainings word to
Amount is used for DIE algorithms.Since DIE is a kind of algorithm of splicing word vector, so the generally setting of the dimension of word vector is smaller, it can
Choosing, in the embodiment of the present invention, the dimension of a word vector is set as 25, i.e. a word vector has the spy in 25 dimensions
Sign.
As shown in fig.3, in the embodiment of the present invention, the detailed process that terminal device polymerize hypernym is as follows:
Step 300:Terminal device obtains multiple pending hypernyms, and determines that each pending hypernym includes respectively
The word vector of each word, and the vector of each word based on acquisition, each pending hypernym is calculated according to special algorithm
Term vector.
Optionally, the special algorithm that terminal device uses can be DIE algorithms.
Specifically, by any one pending hypernym, (for hereinafter referred to as hypernym x), introduction step 300 is held
Line mode is as follows:
In the embodiment of the present invention, in order to draw the character string information of hypernym x and the text semantic letter of hypernym x in the same time
Breath, optionally, using the term vector of DIE algorithms synthesis hypernym x.
The basic thought of DIE algorithms is:Hypernym x term vectors are made of the word vector of hypernym x, the word of different location
The different piece of vector composition term vector, can ensure character string order information in this way, in addition, word vector is based on extensive non-
Structured text trains to obtain, and word vector contains certain Semantic Similarity, so the hypernym x based on the synthesis of word vector
Term vector has certain semantic feature.Specific implementation procedure is as follows:
First, preset at least two subregions of corresponding hypernym x are determined, wherein a sub-regions correspond to hypernym x's
The partial dimensional of term vector;
Secondly, it is based on the corresponding each word vector of the pending hypernym, calculates the provincial characteristics of each sub-regions;
Specifically, can be directed to each sub-regions respectively executes following operation:
Based on the word number of vectors that preset subregion number and hypernym x include, determine that hypernym x includes every respectively
Weight of one word vector in a sub-regions;
According to each word vector and each weight of word vector in one subregion, calculates hypernym x and exist
Provincial characteristics in one subregion.
Finally, each region feature of the hypernym x based on acquisition calculates the term vector for obtaining hypernym x.
It is described for example, following formula may be used in DIE algorithms:
V=[v [0] ..., v [m] ... h [M-1]], m ∈ [0, M-1]
Wherein, i characterizes the serial number of word vector, and I indicates that word number of vectors, m indicate that the serial number of subregion, M indicate subregion
Number indicates that the dimension of the term vector of synthesis is M times of word vector, v indicates that the provincial characteristics of subregion, V indicate hypernym
Term vector, chari indicates i-th of character corresponding word vector in hypernym.In the embodiment of the present invention, so-called provincial characteristics is
Refer to:The feature for the text level that the partial dimensional of term vector corresponding to subregion is embodied.
Such as, it is assumed that hypernym x is " mammal ", and the dimension of term vector is 100, has divided four sub-regions, respectively
For [1,25], [26,50], [51,75], [76,100], then,
V [0]=word vector " food in one's mouth " × f (0,0)+word vector " breast " × f (1,0)+word vector " dynamic " × f (2,0)+word vector
" object " × f (3,0)
V [1]=word vector " food in one's mouth " × f (0,1)+word vector " breast " × f (1,1)+word vector " dynamic " × f (2,1)+word vector
" object " × f (3,1)
V [2]=word vector " food in one's mouth " × f (0,2)+word vector " breast " × f (1,2)+word vector " dynamic " × f (2,2)+word vector
" object " × f (3,2)
V [3]=word vector " food in one's mouth " × f (0,3)+word vector " breast " × f (1,3)+word vector " dynamic " × f (2,3)+word vector
" object " × f (3,3)
V=[v [0], v [1], v [2], v [3]
Wherein, f (i, m) indicates weight of the word vector in a certain subregion, wherein x indicates the serial number of word vector, y tables
Show the serial number of subregion, for example, f (0,0) is indicated, weight of the 0th word vector " food in one's mouth " in the 0th sub-regions [1,25].
Terminal device has carried out region division for the term vector of pending hypernym, and per sub-regions, correspondence waits locating respectively
The partial dimensional of hypernym is managed, i.e., each sub-regions are provided with the provincial characteristics of itself, and are contained in pending hypernym
Multiple word vectors, different word vectors is different to the contribution degree of the provincial characteristics of respective sub-areas in different subregions, because
This, for each word vector that pending hypernym includes, the weight being respectively provided in different subregions can enable every
The provincial characteristics of one sub-regions is embodied by the larger corresponding dimension of word vector of weight, in this way, each provincial characteristics
The text feature for only focusing on realizational portion word vector, so as to effectively promote text specific aim and the spy of each region feature
Accuracy is levied, and then improves the accuracy for the term vector being finally calculated.
Step 310:Terminal device determines each pending hypernym associated entity type in knowledge mapping respectively.
In the embodiment of the present invention, several entities can be corresponded in a pending hypernym knowledge mapping, and these entities
Often corresponding at least one entity type, entity type are that the generality of entity is sorted out, and can embody entity in a certain respect
Feature.
For example, refering to shown in Fig. 4 A and Fig. 4 B, it is assumed that pending hypernym is:" star of the nineties ", and it is in knowledge
Several entities have been corresponded in collection of illustrative plates, e.g., " Liu ", " Zhang ", " Mr. Li ", " Guo so-and-so " etc., wherein
" Liu " and " Mr. Li " is jointly corresponding " video display stars ", and " Liu " and " Guo so-and-so " common correspondence
" singer's class ", it is clear that " Liu " corresponded to two different entity types.And " Mr. Li " and " Guo so-and-so " is right respectively
Different entity types is answered.
It is (hereinafter referred to as upper by taking any one pending hypernym as an example when executing step 310 for such case
Word x), terminal device can determine whether out hypernym x corresponding all entities in knowledge mapping, and determine that all entities are each
The entity type of auto correlation, and the most N number of entity type of associated number of entities is filtered out, as the associated realities of hypernym x
Body type, wherein N is default natural number, N >=1.
For example, it is assumed that hypernym x is " XX popularity highests male ", and in knowledge mapping, the associated entities of hypernym x
There are " Sun Yang ", " Wu x is all ", " LiuxLiang ", " Yuan x is flat ", " king x is clever ", " small vest " etc..
Wherein, " Sun Yang " and " LiuxLiang " corresponding entity type is " sportsman's class ", " Wu x is all " corresponding entity type
It is " scientist's class " for " stars ", " Yuan x is flat " corresponding entity type, and " king x is clever " and " small vest " correspond to " net red
Class ".
Assuming that in all entities of corresponding hypernym x, corresponding " sportsman's class " has 20 entities, corresponding " stars " to have
50 entities, corresponding " scientist's class " have 5 entities, corresponding " netting red class " to have 40 entities.
So, through screening, it is assumed that N=3 then finally determines that hypernym x has corresponded to three entity types, i.e., " stars ",
" netting red class " and " sportsman's class ".
Step 320:Term vector and associated entity type of the terminal device based on each pending hypernym, calculate separately
Term vector similarity between the pending hypernym of each two and entity type similarity.
By taking any one group two-by-two pending hypernym as an example, hereinafter referred to as hypernym x and hypernym y:
So, it is possible, firstly, to calculate word between the corresponding term vectors of hypernym x and the corresponding term vectors of hypernym y to
Similarity is measured, is denoted as, sim1;
Secondly, the entity class between the corresponding entity types of hypernym x and the corresponding entity types of hypernym y can be calculated
Type similarity, is denoted as sim2,
Specifically, can first determine that the associated entity types of hypernym x are associated with hypernym y in two pending hypernyms
Entity type, wherein if hypernym x or/and hypernym y is associated at least two entity types, in hypernym x and hypernym
The entity type similarity of each two entity type is calculated separately between y, and it is highest as final reality to choose similarity value
Body type similarity.
Such as:The associated entity types of hypernym x have " video display stars ", and the associated entity types of hypernym y have " video display
Stars " and " singer's class " then calculate separately following two group objects type similarity:
Hypernym x " video display stars " & hypernyms y " video display stars "=100%
Hypernym x " video display stars " & hypernyms y " singer's class "=40%,
The 100% entity type similarity final as hypernym x and hypernym y can then be taken.
Step 330:When term vector similarity reaches the first pre-determined threshold, and entity type similarity reaches the second pre- gating
In limited time, corresponding pending hypernym polymerize by terminal device.
Still by taking any one group two-by-two pending hypernym as an example, hereinafter referred to as hypernym x and hypernym y
Specifically, assuming that sim1 and sim2 between hypernym x and hypernym y meet the following conditions, then hypernym x is characterized
It is most like hypernym with hypernym y, can be polymerize.
Sim (i, j) >=T1
Sim (i, j) >=T2
Wherein, T1 is preset first pre-determined threshold, and T2 is preset second pre-determined threshold, and T1 and T2 can be by O&M people
Member is configured according to practical work experience, and details are not described herein.
Above-mentioned steps 300-340 only describes a polymerization process, and terminal device can be existed using this scheme repeatedly
The hypernym with identical semanteme is searched in pending hypernym (can include the hypernym after polymerization), and is gathered repeatedly
It closes, to finally obtain the hypernym after the most accurately polymerizeing.
After polymerization is handled, terminal device will can not find the pending hypernym of most like hypernym separately as one
Class, and will find most like hypernym pending hypernym polymerize with most like hypernym after become one kind, finally obtain
The hypernym after each Type of Collective was obtained, these hypernyms are by precisely screening polymerization, eliminating the hypernym of redundant data.
Further, in order to improve polymerization accuracy, optionally, can to each pending hypernym after polymerization again into
The primary polymerization accuracy of row judges, specifically, each group of pending hypernym that terminal device can be directed to after polymerization respectively is held
The following operation of row:
A) terminal device determines the Similar Text part between each pending hypernym in one group of pending hypernym.
Certainly, before determining Similar Text part, optionally, terminal device can first remove each pending hypernym
In stop words and many words, for example, rarely used word, " " " " " " etc. auxiliary words of mood etc..Wherein, stop words is by deactivating
Dictionary provides, and many words are the words that each pending hypernym includes no practical significance.
Then, terminal device can search Similar Text part between each pending hypernym, for example, " most popular
Singer " and " singer of most popularity " in " singer " and " singer " Similar Text part can be considered as, " most " and " most " also may be used
To be considered as Similar Text part.
B) terminal device deletes the Similar Text part between each pending hypernym.
After deleting " most " and " most ", and " singer " and " singer ", remaining textual portions are " prevalence " and " popularity ".
C) terminal device calculates between each pending hypernym the semantic similarity of remaining textual portions and described surplus
The average number of words that remaining textual portions include.
D) determine that the semantic similarity of the remaining textual portions reaches third setting thresholding and the remaining textual portions
Including average number of words be less than the 4th setting thresholding, alternatively, when determining that the remaining textual portions are empty, judgement is directed to described one
The polymerization processing that the pending hypernym of group carries out is effective.
Since " prevalence " and " popularity " hint expression is close, and the average number of words that remaining textual portions include is only 2, is less than
4th setting thresholding " 2.2 ", it is determined that merge effectively, i.e., the two are pending by " most popular singer " and " singer of most popularity "
Hypernym can be polymerize.
The above process is made below by two embodiments and being described in further detail.
Embodiment 1:
Pending hypernym is:" poet's Chen Xiang inflammation works ", " works of poet Mei Shaojing ", " works of poet Zhao Gong ",
" works of poet Lu Zhi " and " works of poet Wang Yi ".
Although these pending hypernyms seem similar, but actual key message is inconsistent, therefore, are getting rid of phase
After textual portions, remaining textual portions are:Chen Xiangyan, Mei Shaojing, Zhao Gong, Lu Zhi, Wang Yi.
Semantic similarity between above-mentioned residue textual portions is less than third pre-determined threshold, and the average number of words for including is about
2.4, it is higher than the 4th pre-determined threshold " 2.2 ", then characterizes this time polymerization in vain, these pending hypernyms cannot be polymerize.
Embodiment 2:
Pending hypernym is:" simple daily life of a family steamed dumpling ", " homely steamed dumpling " and " " steamed dumpling ", gets rid of Similar Text portion
Divide and stop words is with after many words, remaining textual portions are:NULL, NULL and NULL.
Above-mentioned residue textual portions are sky, then characterize this time polymerization effectively, these pending hypernyms can merge.
Further, after determining that one group of pending hypernym can be polymerize, one group after polymerization can be waited locating
After the maximum common characters string in hypernym between each pending hypernym is managed as described one group pending hypernym polymerization
Title.
Such as:" simple daily life of a family steamed dumpling ", " homely steamed dumpling " and " steamed dumpling ", wherein maximum common characters string:Steamed dumpling, then
Steamed dumpling may be used name polymerization after pending hypernym, these carry out retrieval and using when also effectively increase inquiry
Efficiency.
Based on above-described embodiment, as shown in fig.5, in the embodiment of the present invention, terminal device includes at least first and determines list
First 51, second determination unit 52, computing unit 53 and polymerized unit 54, wherein
First determination unit 51 for obtaining multiple pending hypernyms, and determines each pending hypernym packet respectively
The word vector of each word contained, and the vector of each word based on acquisition, calculate each pending upper according to special algorithm
The term vector of word;
Second determination unit 52, for determining each pending hypernym associated entity class in knowledge mapping respectively
Type;
Computing unit 53 is used for the term vector based on each pending hypernym and associated entity type, calculates separately
Term vector similarity between the pending hypernym of each two and entity type similarity;
Polymerized unit 54 reaches the first pre-determined threshold for working as term vector similarity, and entity type similarity reaches
When two pre-determined thresholds, corresponding pending hypernym is polymerize.
Optionally, the special algorithm that the first determination unit 51 uses is the vectorization of density interpolation (DIE) algorithm.
Each word vector based on acquisition, when calculating the term vector of each pending hypernym according to special algorithm, first
Determination unit 51 is used for:
Corresponding pending preset at least two subregion of hypernym is determined according to the dimension of pending hypernym, wherein
One sub-regions correspond to the partial dimensional of the term vector;
Based on the corresponding each word vector of the pending hypernym, the provincial characteristics of each sub-regions is calculated;
Each region feature of the pending hypernym based on acquisition calculates the word for obtaining the pending hypernym
Vector.
Optionally, it is based on the corresponding each word vector of the pending hypernym, calculates the provincial characteristics of each sub-regions
When, the first determination unit 51 includes:
It is directed to each sub-regions respectively and executes following operation:
Based on the word number of vectors that preset subregion number and the pending hypernym include, respectively determine described in wait for
Each weight of word vector in a sub-regions for including in processing hypernym;
According to each word vector and each weight of word vector in one subregion, calculate described pending
Provincial characteristics of the hypernym in one subregion.
Pending hypernym is determined in knowledge mapping when associated entity type, the second determination unit 52 is used for:
Determine pending hypernym corresponding all entities in knowledge mapping;
Determine all entities respectively associated entity type;
The most N number of entity type of associated number of entities is filtered out, is closed as any one described pending hypernym
The entity type of connection, wherein N is default natural number, N >=1.
Optionally, when calculating the entity type similarity between the pending hypernym of each two, computing unit 53 is used for:
Determine the associated entity type of the first hypernym and the associated entity of the second hypernym in two pending hypernyms
Type;
If first hypernym or/and at least two entity type of the second upper word association, described first
The entity type similarity of each two entity type is calculated separately between hypernym and second hypernym;And
It is highest as final entity type similarity to choose similarity value.
After polymerizeing to one group of pending hypernym, polymerized unit 54 is further used for:
Determine the Similar Text part between each pending hypernym in one group of pending hypernym;
Delete the Similar Text part;
Calculate the semantic similarity of remaining textual portions and the remaining textual portions between each pending hypernym
Including average number of words;
Determine that the semantic similarity of the remaining textual portions reaches third setting thresholding and the remaining textual portions packet
The average number of words contained is less than the 4th setting thresholding, alternatively, when determining that the remaining textual portions are empty, judgement is directed to described one group
The polymerization processing that pending hypernym carries out is effective.
Optionally, before determining the Similar Text part in one group of pending hypernym between each pending hypernym,
Polymerized unit 54 is further used for:
Among each pending hypernym, preset stop words and many words are removed.
Polymerized unit 54 is further used for:
Using the maximum common characters string between each pending hypernym in one group of pending hypernym after polymerization as
Title after one group of pending hypernym polymerization.
Based on same inventive concept, the embodiment of the present invention provides a kind of storage medium, and storage polymerize for realizing hypernym
The program of method when described program is run by processor, executes following steps:
Obtain multiple pending hypernyms, and determine respectively the word of each word that each pending hypernym includes to
Amount;
Each word vector based on acquisition, the term vector of each pending hypernym is calculated according to special algorithm;
Each pending hypernym associated entity type in knowledge mapping is determined respectively;
Term vector based on each pending hypernym and associated entity type, it is pending upper to calculate separately each two
Term vector similarity between word and entity type similarity;
It, will when term vector similarity reaches the first pre-determined threshold, and entity type similarity reaches the second pre-determined threshold
Corresponding pending hypernym is polymerize.
As shown in fig.6, being based on same inventive concept, the embodiment of the present invention provides a kind of computer installation, including one
Or multiple processors 60;And one or more computer-readable mediums 61, instruction is stored on the readable medium 61, it is described
When instruction is executed by one or more of processors 60 so that the computer installation executes times introduced in above-described embodiment
It anticipates a kind of method.
In conclusion in the embodiment of the present invention, terminal device is calculated according to the word vector that each hypernym includes on each
Term vector similarity between the word of position, and entity type associated by the corresponding entity of each hypernym calculate on each
Entity type similarity between the word of position, and term vector similarity is reached into the first pre-determined threshold and entity type similarity reaches
Each hypernym of second pre-determined threshold is polymerize;Since hypernym is typically to be made of a little several words, using traditional
Participle operation can bring larger error and information loss, therefore, in the embodiment of the present invention, the word that includes based on hypernym to
The characterized term vector of amount and judged to carry out the similarity between hypernym based on the associated entity type of hypernym, it can be with
Short text as similar hypernym is effectively treated, not only can effectively excavate the text key information that hypernym includes, but also can
Accurately to depict the type feature of hypernym, at the same not only can to avoid the miscellaneous work amount of artificial design features, but also
The generalization ability that model can be enhanced efficiently identifies invalid hypernym, removes the redundant data in hypernym, significantly improves
The accuracy of hypernym polymerization.
It should be understood by those skilled in the art that, the embodiment of the present invention can be provided as method, system or computer program
Product.Therefore, complete hardware embodiment, complete software embodiment or reality combining software and hardware aspects can be used in the present invention
Apply the form of example.Moreover, the present invention can be used in one or more wherein include computer usable program code computer
The computer program production implemented in usable storage medium (including but not limited to magnetic disk storage, CD-ROM, optical memory etc.)
The form of product.
The present invention be with reference to according to the method for the embodiment of the present invention, the flow of equipment (system) and computer program product
Figure and/or block diagram describe.It should be understood that can be realized by computer program instructions every first-class in flowchart and/or the block diagram
The combination of flow and/or box in journey and/or box and flowchart and/or the block diagram.These computer programs can be provided
Instruct the processor of all-purpose computer, special purpose computer, Embedded Processor or other programmable data processing devices to produce
A raw machine so that the instruction executed by computer or the processor of other programmable data processing devices is generated for real
The device for the function of being specified in present one flow of flow chart or one box of multiple flows and/or block diagram or multiple boxes.
These computer program instructions, which may also be stored in, can guide computer or other programmable data processing devices with spy
Determine in the computer-readable memory that mode works so that instruction generation stored in the computer readable memory includes referring to
Enable the manufacture of device, the command device realize in one flow of flow chart or multiple flows and/or one box of block diagram or
The function of being specified in multiple boxes.
These computer program instructions also can be loaded onto a computer or other programmable data processing device so that count
Series of operation steps are executed on calculation machine or other programmable devices to generate computer implemented processing, in computer or
The instruction executed on other programmable devices is provided for realizing in one flow of flow chart or multiple flows and/or block diagram one
The step of function of being specified in a box or multiple boxes.
Although preferred embodiments of the present invention have been described, it is created once a person skilled in the art knows basic
Property concept, then additional changes and modifications may be made to these embodiments.So it includes excellent that the following claims are intended to be interpreted as
It selects embodiment and falls into all change and modification of the scope of the invention.
Obviously, those skilled in the art can carry out the embodiment of the present invention various modification and variations without departing from this hair
The spirit and scope of bright embodiment.In this way, if these modifications and variations of the embodiment of the present invention belong to the claims in the present invention
And its within the scope of equivalent technologies, then the present invention is also intended to include these modifications and variations.
Claims (12)
1. a kind of hypernym polymerization, which is characterized in that including:
Multiple pending hypernyms are obtained, and determine the word vector for each word that each pending hypernym includes respectively;
Each word vector based on acquisition, the term vector of each pending hypernym is calculated according to special algorithm;
Each pending hypernym associated entity type in knowledge mapping is determined respectively;
Term vector based on each pending hypernym and associated entity type, it is pending upper to calculate separately at least each two
Term vector similarity between word and entity type similarity;And
It, will be corresponding when term vector similarity reaches the first pre-determined threshold, and entity type similarity reaches the second pre-determined threshold
Pending hypernym polymerize.
2. the method as described in claim 1, which is characterized in that the special algorithm is density interpolation vectorization DIE algorithms.
3. the method as described in claim 1, which is characterized in that each word vector based on acquisition, according to special algorithm
The term vector of each pending hypernym is calculated, including:
Corresponding pending preset at least two subregion of hypernym is determined according to the dimension of pending hypernym, wherein one
Subregion corresponds to the partial dimensional of the term vector;
Based on the corresponding each word vector of the pending hypernym, the provincial characteristics of each sub-regions is calculated;And
Each region feature of the pending hypernym based on acquisition, be calculated the word of the pending hypernym to
Amount.
4. method as claimed in claim 3, which is characterized in that it is based on the corresponding each word vector of the pending hypernym,
The provincial characteristics of each sub-regions is calculated, including:
It is directed to each sub-regions respectively and executes following operation:
Based on the word number of vectors that preset subregion number and the pending hypernym include, determine respectively described pending
Each weight of word vector in a sub-regions for including in hypernym;
According to each word vector and each weight of word vector in one subregion, calculate
Provincial characteristics of the pending hypernym in one subregion.
5. the method as described in claim 1, which is characterized in that determine pending hypernym associated entity in knowledge mapping
Type, including:
Determine pending hypernym corresponding all entities in knowledge mapping;
Determine all entities respectively associated entity type;
The most N number of entity type of associated number of entities is filtered out, it is associated as any one described pending hypernym
Entity type, wherein N is default natural number, N >=1.
6. method as claimed in claim 5, which is characterized in that calculate the entity type phase between the pending hypernym of each two
Like degree, including:
Determine the associated entity type of the first hypernym and the associated entity type of the second hypernym in two pending hypernyms;
If first hypernym or/and at least two entity type of the second upper word association, upper described first
The entity type similarity of each two entity type is calculated separately between word and second hypernym;And
It is highest as final entity type similarity to choose similarity value.
7. method as claimed in any one of claims 1 to 6, which is characterized in that carry out polymerizeing it to one group of pending hypernym
Afterwards, further comprise:
Determine the Similar Text part between each pending hypernym in one group of pending hypernym;
Delete the Similar Text part;
Calculating the semantic similarity of remaining textual portions and the remaining textual portions between each pending hypernym includes
Average number of words;
Determine that the semantic similarity of the remaining textual portions reaches third setting thresholding and the remaining textual portions include
Average number of words is less than the 4th setting thresholding, alternatively, when determining that the remaining textual portions are empty, judgement waits locating for described one group
It is effective to manage the polymerization processing that hypernym carries out.
8. the method for claim 7, which is characterized in that determine each pending hypernym in one group of pending hypernym
Between Similar Text part before, further comprise:
Among each pending hypernym, preset stop words and many words are removed.
9. the method for claim 7, which is characterized in that further comprise:
Using the maximum common characters string between each pending hypernym in one group of pending hypernym after polymerization as described in
Title after one group of pending hypernym polymerization.
10. a kind of hypernym polyplant, which is characterized in that including:
First determination unit for obtaining multiple pending hypernyms, and determines that each pending hypernym includes every respectively
The word vector of one word, and the vector of each word based on acquisition, the word of each pending hypernym is calculated according to special algorithm
Vector;
Second determination unit, for determining each pending hypernym associated entity type in knowledge mapping respectively;
Computing unit is used for the term vector based on each pending hypernym and associated entity type, calculates separately each two
Term vector similarity between pending hypernym and entity type similarity;
Polymerized unit reaches the first pre-determined threshold for working as term vector similarity, and entity type similarity reaches second and presets
When thresholding, corresponding pending hypernym is polymerize.
11. a kind of storage medium, which is characterized in that store the program for realizing hypernym polymerization, described program is located
When managing device operation, following steps are executed:
Multiple pending hypernyms are obtained, and determine the word vector for each word that each pending hypernym includes respectively;
Each word vector based on acquisition, the term vector of each pending hypernym is calculated according to special algorithm;
Each pending hypernym associated entity type in knowledge mapping is determined respectively;
Term vector based on each pending hypernym and associated entity type, calculate separately the pending hypernym of each two it
Between term vector similarity and entity type similarity;
It, will be corresponding when term vector similarity reaches the first pre-determined threshold, and entity type similarity reaches the second pre-determined threshold
Pending hypernym polymerize.
12. a kind of computer installation, which is characterized in that including one or more processors;And one or more computers can
Medium is read, instruction is stored on the readable medium, when described instruction is executed by one or more of processors so that described
Device executes method as claimed in any one of claims 1-9 wherein.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810100677.0A CN108415950B (en) | 2018-02-01 | 2018-02-01 | Hypernym aggregation method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810100677.0A CN108415950B (en) | 2018-02-01 | 2018-02-01 | Hypernym aggregation method and device |
Publications (2)
Publication Number | Publication Date |
---|---|
CN108415950A true CN108415950A (en) | 2018-08-17 |
CN108415950B CN108415950B (en) | 2021-03-23 |
Family
ID=63126797
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810100677.0A Active CN108415950B (en) | 2018-02-01 | 2018-02-01 | Hypernym aggregation method and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108415950B (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109829041A (en) * | 2018-12-25 | 2019-05-31 | 出门问问信息科技有限公司 | Question processing method and device, computer equipment and computer readable storage medium |
CN110008972A (en) * | 2018-11-15 | 2019-07-12 | 阿里巴巴集团控股有限公司 | Method and apparatus for data enhancing |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7251637B1 (en) * | 1993-09-20 | 2007-07-31 | Fair Isaac Corporation | Context vector generation and retrieval |
CN103559234A (en) * | 2013-10-24 | 2014-02-05 | 北京邮电大学 | System and method for automated semantic annotation of RESTful Web services |
CN104484461A (en) * | 2014-12-29 | 2015-04-01 | 北京奇虎科技有限公司 | Method and system based on encyclopedia data for classifying entities |
CN106372118A (en) * | 2016-08-24 | 2017-02-01 | 武汉烽火普天信息技术有限公司 | Large-scale media text data-oriented online semantic comprehension search system and method |
CN106844658A (en) * | 2017-01-23 | 2017-06-13 | 中山大学 | A kind of Chinese text knowledge mapping method for auto constructing and system |
CN106919577A (en) * | 2015-12-24 | 2017-07-04 | 北京奇虎科技有限公司 | Based on method, device and search engine that search word scans for recommending |
CN107644014A (en) * | 2017-09-25 | 2018-01-30 | 南京安链数据科技有限公司 | A kind of name entity recognition method based on two-way LSTM and CRF |
-
2018
- 2018-02-01 CN CN201810100677.0A patent/CN108415950B/en active Active
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7251637B1 (en) * | 1993-09-20 | 2007-07-31 | Fair Isaac Corporation | Context vector generation and retrieval |
CN103559234A (en) * | 2013-10-24 | 2014-02-05 | 北京邮电大学 | System and method for automated semantic annotation of RESTful Web services |
CN104484461A (en) * | 2014-12-29 | 2015-04-01 | 北京奇虎科技有限公司 | Method and system based on encyclopedia data for classifying entities |
CN106919577A (en) * | 2015-12-24 | 2017-07-04 | 北京奇虎科技有限公司 | Based on method, device and search engine that search word scans for recommending |
CN106372118A (en) * | 2016-08-24 | 2017-02-01 | 武汉烽火普天信息技术有限公司 | Large-scale media text data-oriented online semantic comprehension search system and method |
CN106844658A (en) * | 2017-01-23 | 2017-06-13 | 中山大学 | A kind of Chinese text knowledge mapping method for auto constructing and system |
CN107644014A (en) * | 2017-09-25 | 2018-01-30 | 南京安链数据科技有限公司 | A kind of name entity recognition method based on two-way LSTM and CRF |
Non-Patent Citations (1)
Title |
---|
MOHAMMAD TAHER PILEHVAR 等: ""From senses to texts: An all-in-one graph-based approach for measuring semantic similarity"", 《ARTIFICIAL INTELLIGENCE》 * |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110008972A (en) * | 2018-11-15 | 2019-07-12 | 阿里巴巴集团控股有限公司 | Method and apparatus for data enhancing |
CN110008972B (en) * | 2018-11-15 | 2023-06-06 | 创新先进技术有限公司 | Method and apparatus for data enhancement |
CN109829041A (en) * | 2018-12-25 | 2019-05-31 | 出门问问信息科技有限公司 | Question processing method and device, computer equipment and computer readable storage medium |
Also Published As
Publication number | Publication date |
---|---|
CN108415950B (en) | 2021-03-23 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Wei et al. | Superpixel hierarchy | |
CN102693311B (en) | Target retrieval method based on group of randomized visual vocabularies and context semantic information | |
Agathos et al. | 3D articulated object retrieval using a graph-based representation | |
Carpineto et al. | Concept data analysis: Theory and applications | |
Li et al. | GPS estimation for places of interest from social users' uploaded photos | |
Xie et al. | Contextual query expansion for image retrieval | |
CN110609902A (en) | Text processing method and device based on fusion knowledge graph | |
Nocaj et al. | Organizing search results with a reference map | |
CN103761286B (en) | A kind of Service Source search method based on user interest | |
CN108460162A (en) | Recommendation information processing method, device, equipment and medium | |
WO2007058483A1 (en) | Method, medium, and system with category-based photo clustering using photographic region templates | |
CN114997288A (en) | Design resource association method | |
CN108415950A (en) | A kind of hypernym polymerization and device | |
Elleuch et al. | A generic framework for semantic video indexing based on visual concepts/contexts detection | |
CN117315090A (en) | Cross-modal style learning-based image generation method and device | |
CN106384127B (en) | The method and system of comparison point pair and binary descriptor are determined for image characteristic point | |
Skopal et al. | On (not) indexing quadratic form distance by metric access methods | |
Ding et al. | A learned spatial textual index for efficient keyword queries | |
Pant | Performance comparison of spatial indexing structures for different query types | |
Liu et al. | Extended cone-curvature based salient points detection and 3D model retrieval | |
Chen et al. | Self-supervised learning for sketch-based 3d shape retrieval | |
Zhang et al. | Cross indexing with grouplets | |
Ding et al. | Searching Top-K Similar Moving Videos | |
CN110188301A (en) | Information aggregation method and device for website | |
CN109657623A (en) | A kind of facial image similarity calculating method, device, computer installation and computer readable storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |