CN102955837A - Analogy retrieval control method based on Chinese word pair relationship similarity - Google Patents

Analogy retrieval control method based on Chinese word pair relationship similarity Download PDF

Info

Publication number
CN102955837A
CN102955837A CN2011104154039A CN201110415403A CN102955837A CN 102955837 A CN102955837 A CN 102955837A CN 2011104154039 A CN2011104154039 A CN 2011104154039A CN 201110415403 A CN201110415403 A CN 201110415403A CN 102955837 A CN102955837 A CN 102955837A
Authority
CN
China
Prior art keywords
relative
word
predicate
institute
steps
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN2011104154039A
Other languages
Chinese (zh)
Inventor
吕钊
梁超
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
East China Normal University
Original Assignee
East China Normal University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by East China Normal University filed Critical East China Normal University
Priority to CN2011104154039A priority Critical patent/CN102955837A/en
Publication of CN102955837A publication Critical patent/CN102955837A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides an analogy retrieval control method based on Chinese word pair relationship similarity, to obtain target words based on key word retrieval. The method comprises the following steps of: a, obtaining a word pair; b, extracting a short sentence including the word pair according to the retrieval result; c, extracting a word pair relationship mode set according to a short sentence set including the word pair; d, carrying out primary clustering on a first relationship word set in the word pair relationship mode set so as to obtain a second relationship word set; e, carrying out secondary clustering on the second relationship word set, and using the result obtained from the secondary clustering as a first middle relationship word set; g, forming the relationship words in the first middle relationship word set into first word pairs with key words one by one, and repeating the steps a to e; and h, using each second middle relationship word set as a target word set. The method is based on large-scale text information statistics, a diversity of relationships in entity pairs are found under the condition that the entity relationship is unknown, and corresponding candidate items are found according to each relationship.

Description

A kind of based on Chinese word to concerning the analogy retrieval control method of similarity
Technical field
The present invention relates to Chinese word to concerning similarity and technical field of information retrieval, specifically be based on Chinese word to concerning the analogy retrieval technique of similarity.
Background technology
Along with the sustainable development of WWW and the continuous progress of search engine, it is more and more easier that web search becomes.First generation search engine is the site search of the artificial catalog classification navigation retrieval take Yahoo as representative, and it has begun the epoch of internet hunt.The second generation is the search that is based on keyword and particular algorithm take Google as representative, be rely on the machine crawl, be based upon the extensive Webpage search on the super link analysis basis, the accuracy of its Search Results has risen to webpage from the website.Also there are some problems in present search engine, can not cover whole Internet resource such as single search engine, searches for accurately not, can not truly reflect user view.The target of search engine of future generation is can more apish thinkings and idea, searching for generally of concept, it is by the association between the analyzing web page, set up a kind of more intelligentized concept classification mode of similar human thinking, by apish thoughtcast, the concept that search is carried out the key word association and classifies enlarging extension and the degree of depth of search.
The present invention wishes to propose a kind of new retrieval control method.
Summary of the invention
For defective of the prior art, the purpose of this invention is to provide a kind of based on Chinese word to concerning the analogy retrieval control method of similarity.
According to an aspect of the present invention, provide a kind of and based on Chinese word control method is retrieved in the analogy that concerns similarity, it is used for obtaining at least one target word based at least one keyword retrieval, it is characterized in that, comprise the steps: that a. obtains word pair, wherein said word is to being and the word of described keyword and the identical relation of described target word pair; B. according to described result for retrieval, extraction comprises the right short sentence of institute's predicate, and wherein said short sentence is for comprising simultaneously a right complete sentence of institute's predicate; C. comprise the set of the right short sentence of institute's predicate and extract word relation schema is gathered according to described; D. institute's predicate is carried out a cluster to obtain the set of the second relative to the first relative set in the relation schema set; E. the secondary cluster is carried out in described the second relative set, and the result that described secondary cluster obtains is gathered as the first middle relative; G. with described in the middle of first the relative in the relative set form the first word pair with described keyword one by one, repeat above-mentioned steps a to step e, thereby with described the first word the second corresponding middle relative is gathered obtaining for each described first word, wherein, described relative be in the described relation schema except institute's predicate at least one word; H. each described second middle relative set is gathered as target word, wherein, the corresponding described target word set of relative in each described second middle relative set, the relative set forms the two-dimensional result collection in the middle of described the 4th relative set and described second.
Preferably, comprise also that between described step e and described step g step: f. carries out cluster three times to the described first middle relative set, and the result that described three clusters are obtained is as relative set in the middle of described first, wherein, in the described step g to described each first word to repeating above-mentioned steps a to step f.
Preferably, described step a comprises the steps: that a ' retrieves institute's predicate pair in search engine.
Preferably, described step a comprises the steps: that the title minute clauses and subclauses in the a1. result for retrieval that institute's predicate is right extract.
Preferably, described step c comprises the steps: that c1. extracts the relation schema of each short sentence described in the described set that comprises the right short sentence of institute's predicate; C2. described relation schema is divided into groups according to relational model, form institute's predicate relation schema is gathered.
Preferably, described step c1 comprises the steps: that also c11. is divided into each short sentence described in the described set that comprises the right short sentence of institute's predicate and has independent semantic word; C12. with in described each short sentence described each have independent semantic word and carry out part-of-speech tagging; That c13. extracts part of speech in described each short sentence and be noun and verb describedly has an independent semantic word; C14. the word combination in described each short sentence that extraction is obtained is as the described relation schema of described short sentence.
Preferably, described step c2 comprises the steps: that also c21. mates described relation schema and described relational model, and the described relation schema with identical described relational model is divided into one group; C22. identical described relation schema in each group is merged, and the frequency of cumulative described relation schema; C23. different described relation schema in each group being carried out similarity calculates; C24. the described relation schema that described similarity is surpassed first threshold merges, and the frequency of cumulative described relation schema; C25. all are gathered relation schema as institute's predicate through the described relation schema of above-mentioned union operation, wherein said each word is to the corresponding frequency values of relation schema.
Preferably, described steps d comprises the steps: that d1. extracts institute's predicate to the first relative set described in the relation schema set; D2. cluster is carried out in described the first relative set one time, to obtain described the second relative set.
Preferably, described steps d 1 also comprise the steps: d11. extract institute's predicate to each word described in the relation schema set to the relative in the relation schema, wherein, described relative be institute's predicate in the relation schema except the external word of institute's predicate; D12. all described relatives are gathered as described the first relative, wherein, the corresponding frequency values of described each relative, described frequency values is the frequency that described relative place institute predicate occurs relation schema;
Preferably, described steps d 2 comprises the steps: that also d21. merges identical described relative in described the first relative set, and cumulative described frequency values corresponding to described relative; D22. will sort according to described frequency values through the described relative of above-mentioned merging; D23. will gather as described the second relative through the described relative set of above-mentioned ordering.
Preferably, described step e comprises the steps: that e1. divides into groups the described relative in described the second relative set; E2. with relatival described frequency values is the highest described in every group described relative as candidate word; E3. described every group of candidate word set of selecting is as the described first middle relative set;
Preferably, described step e1 comprises the steps: that also the e11. described relative that the described frequency values in described the second relative set is the highest is as centre word; E12. the described all relatives except described centre word in described the second relative set and described centre word are carried out similarity calculating; E13. the described relative that described similarity is identical is divided into one group.
Preferably, described step f comprises the steps: that f1. carries out in twos similarity calculating with the described all relatives in the described first middle relative set; F2. the described relative that described similarity is surpassed Second Threshold merges, and cumulative described frequency values corresponding to described relative; F3. gather as the described second middle relative through the set of the relative after the above-mentioned merging.
Preferably, comprise the steps: also before the described step g that i1. judges whether the described second middle relative set is described target word set.If i2. the described second middle relative set is not described target word set, then continue execution in step g.
Preferably, if the relative set is described target word set, then execution in step h also comprise the steps: i3. described second after described step I 2 in the middle of.
The present invention is based on word between the similarity that concerns searching key word is carried out analogy expansion, suppose that tera incognita information and known art information have similarity in the form of expression, can infer the relevant information that tera incognita by the similarity that concerns that compares known art information and tera incognita information.For example, if the user of a certain brand product wants to search for the product of other brands, she does not know the title of the product wanted or describes the key word of the product of wanting herself, but the product of brand commonly used and know brand product and how to work, product function and use occasion etc. are that the user knows, and this is an important clue of other brand products of search.Specifically, most of users know ipod, the music player that a kind of Apple sells.If want to search the Related product of Microsoft, they will find the analogy relation of the music player of ipod and Microsoft's sale.Be exactly more specifically, a tuple that comprises three entries is provided, for example (apple, iPod, Microsoft), the present invention just can find out Zune.It will be appreciated by those skilled in the art that iPod is the music player of Apple, Zune is the music player of Microsoft.
Potential relation search is a kind of a kind of novel search modes of an analogy relation degree being retrieved based on Chinese word.For user's tera incognita, can effectively obtain the information that needs.The method that the present invention adopts is added up based on extensive text message, can find out entity to the multiple relation of an existence under the prerequisite of entity relationship the unknown, then finds out candidate item corresponding to this relation according to each relation.
Description of drawings
By reading the detailed description of non-limiting example being done with reference to the following drawings, it is more obvious that other features, objects and advantages of the present invention will become:
Fig. 1 shows the schematic diagram of relatival control method between acquisition;
Fig. 2 illustrates the schematic diagram of 18 kinds of relational models;
Fig. 3 illustrates first embodiment of the invention, based on the process flow diagram of Chinese word to the analogy search method that concerns similarity;
Fig. 4 illustrates first embodiment of the invention, extracts word to the process flow diagram of relation; And
Fig. 5 illustrates first embodiment of the invention, the process flow diagram of three clusters.
Embodiment
By reading the detailed description of non-limiting example being done with reference to the following drawings, it is more obvious that other features, objects and advantages of the present invention will become:
Fig. 1 shows relatival method between acquisition.Diagram will be inquired about as an example.See particularly, originally illustrate six transaction modules provided by the invention, the word that the user will need to retrieve finally obtains middle relative set to the processing through described six transaction modules.Particularly, those skilled in the art understand, institute's predicate is to being and the word of described keyword and the identical relation of described target word pair, for example the user wants to retrieve target word " Microsoft " by keyword " SQL Server 2008 ", because SQL Server 2008 is relational database management systems of Microsoft, and the known mySQL of described user is the relational database management system of Oracle company, and then described user can use (mySQL, Oracle) as institute's predicate pair.More specifically, described six transaction modules are respectively pretreated model, short sentence extraction model, relation schema extraction model, Clustering Model, secondary Clustering Model and three Clustering Model.At first with institute's predicate to inputting described pretreated model, described pretreated model is by existing search engine, such as Google, Bing, Baidu and Wikipedia, with word in the inputted search engine.From the results page of returning, can obtain a series of sentences that comprise word.Extraction processing through described short sentence extraction model obtains described short sentence set.The set of described short sentence is extracted to process through the coupling of described relation schema extraction model again and is obtained institute's predicate relation schema is gathered.Described Clustering Model is carried out pattern clustering to institute's predicate to described the first relative set in the relation schema set and is obtained by sorted the second relative set of frequency, described secondary Clustering Model is carried out 2 clusters to the described relative in described the second relative set and is obtained the first middle relative set, wherein, described the second relative set is calculated through similarity and is divided into groups, choose every group of the highest described relative of medium frequency and obtain the described first middle relative set, the clustering processing of gathering through described three Clustering Model for the described first middle relative obtains the described second middle relative set.More specifically, those skilled in the art understand, after relative is gathered in the middle of obtaining described second, with described in the middle of second described each relative in the relative set form another word pair in conjunction with described keyword, described another word will be to searching for through in the search engine by described six transaction modules.Extract, cluster draws more accurate target word set corresponding to described each relative.Input the method for obtaining target word D among the figure identical with diagram, repeat no more.
 
Fig. 2 illustrates 18 kinds of relational models.Particularly, 18 kinds of relation schemas that originally illustrate are respectively nvXY, XnvY, XYnv, nXvY, nXYv, XnYv, nXY, XnY, XYn, vnXY, XvnY, XYvn, vXnY, vXYn, XvYn, vXY, XvY and XYv, wherein, n is that part of speech is a word of noun, and v is that part of speech is a word of verb, and X and Y are institute's predicate pair.For example institute's predicate is to being iPod and apple, then described XY is iPod and apple, particularly, those skilled in the art understand, the order of described XY does not affect enforcement of the present invention, therefore the expressed meaning of XY and YX is identical, and for example described relational model nvXY and described relational model nvYX are the same relational model.Institute's predicate when being iPod and apple, described relational model nvXY, its composition at first is a noun, is a verb afterwards, is institute's predicate pair at last, for example " software is downloaded apple iPod ", itself and described relational model nvXY are complementary; Described relational model XnvY, its composition at first is a word of institute's predicate centering, is a noun afterwards, a verb is behind a described noun, another word of institute's predicate centering at last, for example " apple new product release iPod ", itself and described relational model XnvY are complementary; Described relational model XYnv, it forms at first is institute's predicate pair, is a noun afterwards, is a verb afterwards, for example " apple iPod commodity selling ", itself and described relational model XYnv are complementary; Remain described 15 kinds of relational model matching process and above-mentioned 3 kinds of relational models are similar, particularly, do not repeat them here.
 
Fig. 3 illustrates first embodiment of the invention, based on the process flow diagram of Chinese word to the analogy search method that concerns similarity.Particularly, originally illustrate from the user and input term to the whole process flow diagram to last acquisition two-dimensional result collection, one has 7 steps.At first be step 201, to retrieving, the Search Results according to search engine returns divides clauses and subclauses to extract title to the pretreated model that technical solution of the present invention provides according to the word of user input.Described search engine preferably is Baidu's search engine.Particularly, it will be appreciated by those skilled in the art that institute's predicate to be the user according to the relation of target word and described known keyword, another known words of the with it identical relation of input pair.For example the user wants to retrieve target word " Microsoft " by keyword " SQL Server 2008 ", because SQL Server 2008 is relational database management systems of Microsoft, and the known mySQL of described user is the relational database management system of Oracle company, then described user can use (mySQL, Oracle) as institute's predicate pair.After this be step 202, extraction comprises the right short sentence of institute's predicate.Particularly, those skilled in the art understand, according to the title of above-mentioned extraction judge institute's predicate between whether have any space or punctuation mark, with institute's predicate between do not exist the punctuate in any space to meet all titles form the short sentence set as short sentence, to guarantee that word is to appearing in the complete sentence.For example, when institute's predicate when being " apple " and " iPod ", described title is " apple online _ Chinese apple portal website _ apple brand shop apple software download iphoneipod ", then predicate is to appearing in the complete sentence in order to guarantee, the short sentence that this title extracts should be " apple brand shop apple software is downloaded iphoneipod ".Again for example, when institute's predicate when being " apple " and " iPod ", described title is " [iPod prefecture] apple iPod complete works _ apple MP3 quotation-ZOL Zhong Guan-cun is online ", then predicate is to appearing in the complete sentence in order to guarantee, the short sentence that this title extracts should be " apple iPod is complete works of ".Be step 203 after the step 202, according to described short sentence set, each short sentence carried out corresponding participle, group match, the operations such as calculated rate.Short sentence after the operation and frequency thereof are formed word to the set of relation schema.Institute's predicate is constituted by described word relation schema, and the combination of described word comprises institute's predicate pair, at least one noun or at least one verb.For example, when institute's predicate when being " apple " and " iPod ", institute's predicate to relation schema can be " sale of apple iPod new product " its frequency values be 3 or " apple new product release iPod " its frequency values be 5.Particularly, institute's predicate will be described hereinafter extraction and the frequency computation part of relation schema, does not repeat them here.After having formed the set of word to relation schema, execution in step 204, extract institute's predicate to the relative in the relation schema, particularly, those skilled in the art understand, described relative is except the external word of institute's predicate in its corresponding relation pattern, for example, when institute's predicate when being " apple " and " iPod ", institute's predicate is that the relative of " apple new product release iPod " is " new product " and " issue " to relation schema, and institute's predicate is 5 to relation schema for the frequency values of " apple new product release iPod ", and then described relative also all is 5 for the frequency values of " new product " and " issue ".And described relative and corresponding frequency values thereof added described the first relative set.Again for example, institute's predicate is that the relative of " sale of apple iPod new product " is " new product " and " sale " to relation schema, and institute's predicate is 3 to relation schema for the frequency values of " sales of apple iPod new product ", and then described relative is that the frequency values of " new product " and " sale " also all is 3.And described relative and corresponding relative thereof added described the first relative set.After forming the first relative set, go heavy described the first relative set and the cumulative operation of frequency, and according to frequency relative the second relative that forms after the cluster first time that sorts is gathered.Particularly, for example in the set of described the first relative " new product " its frequency values being arranged is 5, and " issue " its frequency values is 5, " new product " its frequency values be 3 and " sale " its frequency values be 3, then at first merge described relative " new product ", its frequency values is 8 after merging.Obtain described the second relative set after described frequency values ordering, it comprises that " new product " its frequency values is 8, and " issue " its frequency values is 5, and " sale " its frequency values is 3.More specifically, it will be appreciated by those skilled in the art that when described relatival frequency values was identical, it preferably can sort according to initial, its ordering situation does not affect performance of the present invention, does not repeat them here.Carry out for the first time for the second time cluster after the cluster, step 205 namely, the relative that the second relative set medium frequency after the cluster general cluster first time is the highest for the second time is as centre word, for example the described relative " new product " in the above-mentioned example is as described centre word, and other relative and described centre word carries out similarity calculating in will gathering, with similarity identical be divided into one group, extract the highest relative of every group of medium frequency value form first in the middle of the relative set.Be afterwards cluster for the third time, step 206, by calculating in the set of described relative in twos relatival similarity, further cluster, the relative that similarity is surpassed Second Threshold merges and adds new relative set, the relative in the relative set that forms after the cluster the described second time is all carried out the new relative set that forms behind the aforesaid operations and is relative set in the middle of second after the cluster for the third time.After three clusters, execution in step 207, judge obtain second after three clusters in the middle of the relative set whether be the target word set, gather if not target word.Then execution in step 208, extract relative in the relative set after described three clusters and the described keyword of wanting to retrieve as institute's predicate pair, process by above-mentioned steps 201 to 206, relative in the relative set after each described three cluster all will obtain one group of candidate word set according to aforesaid operations.Step 207 judges that described candidate word set is the target word set, and then final step 209 obtains the namely target word set of two-dimensional result collection according to described relative set, and described two-dimensional result collection is back to the user.
 
Fig. 4 illustrates first embodiment of the invention, extracts word to the process flow diagram of relation.Particularly, originally illustrate the whole process that relation schema extraction model provided by the invention extracts the right relation schema of described input word and forms described relation schema set, it has 5 steps.At first be step 231, for each short sentence in the above-mentioned short sentence set, the present invention uses the Chinese word segmentation instrument with described short sentence participle.Described participle instrument can be ICTCLAS participle instrument preferably, and its participle and part-of-speech tagging precision reach more than 95%.Described short sentence is divided into the word with independent semanteme behind the participle, and each word has part-of-speech tagging.There is not semantic word for existing in each short sentence in the set, such as stop words and conjunction.Can remove these insignificant words according to part-of-speech tagging the present invention.The present invention only extracts noun and verb in the method for the invention, and these nouns and verb have represented trunk and the meaning of whole sentence.Word after the extraction constitutes the word composite set.Be step 232, the set of described word combination is divided into groups according to described sentence pattern thereafter.Particularly, it will be appreciated by those skilled in the art that for better grouping, the present invention proposes a model that comprises 18 kinds of patterns.As shown in Figure 2, X represents word A, and Y represents word B, the n representation noun, and v represents verb.For the combination of each word in the set, the present invention carries out the sentence pattern coupling to it, then is referred in different grouping corresponding to 18 kinds of sentence patterns according to the sentence pattern of its coupling.Be step 233 after the step 232, in the process of cluster, will carry out similarity calculating to two different word combinations assigning in same group.Particularly, the synonym woods is used in the calculating that it will be appreciated by those skilled in the art that described similarity.After this be step 234, calculate in the process of grouping sentence pattern is identical and the same or analogous word of content is combined into the cumulative and record of line frequency according to described similarity.Be step 235 at last, with the set of the word behind aforesaid operations combination and frequency thereof as the set of word to relation schema.
 
Fig. 5 illustrates first embodiment of the invention, the process flow diagram of three clusters.Particularly, originally illustrate provided by the invention based on Chinese word to birdsing of the same feather flock together for the first time in the analogy search method that concerns similarity, the whole process that acquisition the described second middle relative of for the second time birdsing of the same feather flock together and birds of the same feather flock together is for the third time gathered, it had for 9 steps, at first be step 241, predicate obtains described the first relative set to the relation in each relation schema in the relation schema set in the extraction.Be step 242, the relative in described the first relative set is removed heavily to be about to identical relative merge, and the relatival frequency of occurrences that will merge be cumulative thereafter.Step 243, described relative are carried out rank according to the frequency that each relative adapts after going to weigh, and obtain described the second relative set.Thereafter be step 244, for the set of described the second relative, the present invention chooses word centered by the relative that ranks the first.Calculate any relative in the set of described the second relative and the word similarity of described centre word.After having calculated similarity, be step 245, carry out the secondary grouping according to described similarity, the relative that described similarity is identical is assigned in one group, and the highest relative of the frequency of occurrences in each group is extracted as candidate word.Be step 246 thereafter, relative set in the middle of all described candidate word consist of described first.Be step 247 after the step 246, calculate relative similarity between any two in the described first middle relative set, further described relative is carried out cluster.Be step 248, if the similarity of two words just adds the relative merging in the new relative set again above described Second Threshold in the result of calculation thereafter.Final is step 249, relative set in the middle of described new relative set consists of described second.
 
More specifically, it will be appreciated by those skilled in the art that in a preferred embodiment, can realize as follows control method of the present invention:
Step 1, extraction relative.At first grasp webpage and Extracting Information.The present invention uses baidu as search engine.Word in the inputted search engine, will be returned a series of Search Results, these Search Results are saved as original language material.From original language material, title minute clauses and subclauses are extracted.In order to improve the accuracy of the candidate word D that searches out, need to extract abundant language material.
Step 2, find the entry that comprises A and B.The objective of the invention is to find and comprise the right sentence of word.In order to find the syntactic pattern that represents semantic relation between A and two words of B, the present invention consider to mate shape as short sentence, the word string that matches represents with t.Wherein p represents punctuation mark, and the * representative removes the arbitrary continuation character of space and punctuation mark, and does not have any space or punctuation mark between A and the B word.Under this prerequisite, the present invention can guarantee that A word and B word appear in the complete sentence.After extracting, the present invention obtains the set of t.
Step 3, participle extract trunk and cluster.For each the short sentence t among the set T, the present invention uses the Chinese word segmentation instrument with the t participle.T is divided into the word with independent semanteme behind the participle, and each word has part-of-speech tagging.The set of the sentence after the present invention marks with table.For in each in exist and not have semantic word, such as stop words and conjunction.Can remove these insignificant words according to part-of-speech tagging the present invention.The present invention only extracts noun and verb in the method for the invention, and these nouns and verb have represented trunk and the meaning of whole sentence.Word combination s after the extraction consists of set.
Obtained comprising the set of word fragment through above-mentioned treatment step the present invention.For pair set carries out cluster, the present invention proposes a model that comprises 18 kinds of patterns.As shown in Figure 2, X represents word A, and Y represents word B, the n representation noun, and v represents verb.For each the short sentence s in the set, the present invention carries out sentence pattern coupling to it, then is referred in different grouping corresponding to 18 kinds of sentence patterns according to the sentence pattern of its coupling.Identical and the same or analogous sentence of content carries out the cumulative and record of frequency with sentence pattern in the process of grouping, in order to realize this goal, will to carry out similarity and calculates assigning to two different s in same group in the process of cluster.The synonym woods is used in the calculating of similarity.
Obtain comprising the set of pattern p after the cluster through the present invention after the cluster, the corresponding frequency values f of each pattern p.
Step 4, concern rank.Relative in the set is extracted, because there are a large amount of phenomenons that repeat in the word that extracts in different sentence structures, so will go relative heavy and the frequency of occurrences is cumulative.Relative carries out rank according to frequency values f after going to weigh, and then obtains a set of words that sequences name by frequency.
Step 5, utilization concern similarity secondary cluster.For set, the present invention chooses word centered by the relative that ranks the first.For arbitrarily, calculate and the word similarity.After having calculated similarity, carry out the secondary grouping according to similarity, the word that similarity is identical is assigned in one group, and the data acquisition after the grouping is.For, get the word that the frequency of occurrences is the highest in the grouping by grouping and extract as candidate word, obtain the relative set.
Step 6, for the third time cluster acquisition relative.Still exist some to concern the word that the degree of correlation is very high in set, we further carry out cluster to relative by the calculated relationship word degree of correlation between any two.If the degree of correlation of two words surpasses threshold value and just the word merging added again in the new relative set, we obtain more accurate relative set thus.
Step 7, acquisition target word.For in the set each, according to 1 to 6 step to word to processing, obtain at last target word set.Obtain one group of candidate word set for each the present invention.Finally we obtain a two-dimensional result collection.
More specifically, below show an alternative embodiment of the invention, it has realized control method of the present invention by a concrete example.
We with (Yao Ming, Ye Li); (lindane,?) as an example, will (Yao Ming, Ye Li) in the inputted search engine, we obtain complete comprising (Yao Ming, sentence Ye Li), for example:
Yao Ming's leaf jasmine love Fairy-tale _ Online Video is watched _ potato net video Yao Ming leaf jasmine
Yang Lan lets out nature's mystery in English special column: Yao Ming's leaf jasmine baby is daughter (figure)-Qingdao News Network
How much wife Yao Ming Ye Li wife Yao Ming Ye Li height is/details _ wife Yao Ming Ye Li, Ye Lishen ...
We obtain after utilizing masterplate to extract sentence:
Yao Ming's leaf jasmine love Fairy-tale
Yao Ming's leaf jasmine baby is daughter
How much wife Yao Ming Ye Li height is
Short sentence is carried out obtaining behind the participle:
Yao Ming/n leaf jasmine/n love/n children's stories/n
Yao Ming/n leaf jasmine/baby n/n is/daughter v/n
Yao Ming/wife n/n leaf jasmine/n height/n is/v what/r
Obtain after extracting trunk:
Yao Ming/n leaf jasmine/n love/n children's stories/n
Yao Ming/n leaf jasmine/baby n/n is/v
Yao Ming/wife n/n leaf jasmine/n
Result for the keyword statistics word frequency that extracts is (for avoiding redundant, only listing front ten):
Wedding/n 13
Love/n 12
Wedding photography/n 8
Australia/n 7
Wedding photo/n 7
Hold/v 6
Bat/v 6
Daughter/n 6
Wife/n 5
Hand in hand/v 4
Result after the secondary cluster is (only listing 4 groups as example):
First group:
Wedding 1.0 13
Second group:
Physical culture 0.36923076923076925 1
The 3rd group:
The good fortune of the whole family 0.21721212121212127 1
Photo 0.21721212121212127 2
Take a group photo 0.21721212121212127 1
Wedding photo 0.21721212121212127 7
The head of a bed 0.21721212121212127 1
The 4th group:
The offspring 0.12631578947368424 1
A thousand pieces of gold 0.12631578947368424 4
Child 0.12631578947368424 1
Wife 0.12631578947368424 1
Daughter 0.12631578947368424 6
Mr. and Mrs 0.12631578947368424 2
Child 0.12631578947368424 4
Wife 0.12631578947368424 5
Baby 0.12631578947368424 3
At last we to obtain relative as follows:
Wedding 13 1.0
Wine drunk at wedding feast 1 0.896
Course 2 0.6153846153846154
Interesting episode 4 0.6000000000000001
Physical culture 1 0.36923076923076925
Inside story 2 0.28571428571428575
Means 3 0.2424242424242425
Wedding photo 7 0.21721212121212127
Advertisement 2 0.18863157894736846
Love 12 0.17142857142857146
Stadium 1 0.1666976744186047
New house 1 0.14933333333333335
Daughter 6 0.12631578947368424
The lover 1 0.12193684210526318
The U.S. 2 0.11162790697674421
Australia 7 0.1116279069767442
Marry 3 0.07407407407407407
Newly-married 2 0.044444444444444446
Obtain after the cluster for the third time:
Wedding 14
Wedding photo 9
Love 12
The lover 1
Marry 3
Newly-married 2
New house 1
Daughter 7
Physical culture 1
Stadium 1
Means 3
Advertisement 2
The U.S. 3
Australia 1
For above relative, after for example (lindane, love) carried out the identical process processing, we obtained Xie Xingfang.Example for determining relation accurately matches Xie Xingfang and has namely reached purpose of the present invention.
Further, those skilled in the art also understand, and change in the example at another, preferably, provide to comprise three keyword A, and B, C utilizes the present invention can find out target keyword D, and wherein the relation of A and B is approximately equal to the relation of C and D.For example, input entry A=apple, B=iPod, and C=Microsoft, Zune is as D in output, and wherein (apple is iPod) with (Microsoft, relation Zune) is almost identical.If the relation between two entities is well-determined, we obtain unique target candidate word or one group of target candidate set of words.If two words have multiple relation, then can find out every kind of one or more target candidate words that relation is corresponding, and then obtain a result set with two-dimensional structure.
Above specific embodiments of the invention are described.It will be appreciated that, the present invention is not limited to above-mentioned specific implementations, and those skilled in the art can make various distortion or modification within the scope of the claims, and this does not affect flesh and blood of the present invention.

Claims (15)

  1. A kind of based on Chinese word to concerning the analogy retrieval control method of similarity, it is used for obtaining at least one target word based at least one keyword retrieval, it is characterized in that, comprises the steps:
    A. obtain word pair, wherein said word is to being and the word of described keyword and the identical relation of described target word pair;
    B. according to described result for retrieval, extraction comprises the right short sentence of institute's predicate, and wherein said short sentence is for comprising simultaneously a right complete sentence of institute's predicate;
    C. comprise the set of the right short sentence of institute's predicate and extract word relation schema is gathered according to described;
    D. institute's predicate is carried out a cluster to obtain the set of the second relative to the first relative set in the relation schema set;
    E. the secondary cluster is carried out in described the second relative set, and the result that described secondary cluster obtains is gathered as the first middle relative;
    G. with described in the middle of first the relative in the relative set form the first word pair with described keyword one by one, repeat above-mentioned steps a to step e, thereby with described the first word the second corresponding middle relative is gathered obtaining for each described first word, wherein, described relative be in the described relation schema except institute's predicate at least one word;
    H. each described second middle relative set is gathered as target word, wherein, the corresponding described target word set of relative in each described second middle relative set, the relative set forms the two-dimensional result collection in the middle of described the 4th relative set and described second.
  2. Control method according to claim 1 is characterized in that, also comprises step between described step e and described step g:
    F. cluster is carried out in the described first middle relative set three times, and the result that described three clusters obtain is gathered as the described first middle relative,
    Wherein, in the described step g to described each first word to repeating above-mentioned steps a to step f.
  3. Control method according to claim 1 and 2 is characterized in that, described step a comprises the steps:
    A ' retrieves institute's predicate pair in search engine.
  4. According to claim 1, each described control method in 3 is characterized in that, described step a comprises the steps:
    A1. the title minute clauses and subclauses in the result for retrieval that institute's predicate is right extract.
  5. According to claim 1, each described control method in 4 is characterized in that, described step c comprises the steps:
    C1. extract the relation schema of each short sentence described in the described set that comprises the right short sentence of institute's predicate;
    C2. described relation schema is divided into groups according to relational model, form institute's predicate relation schema is gathered.
  6. Control method according to claim 5 is characterized in that, described step c1 also comprises the steps:
    C11. each short sentence described in the described set that comprises the right short sentence of institute's predicate is divided into and has independent semantic word;
    C12. with in described each short sentence described each have independent semantic word and carry out part-of-speech tagging;
    That c13. extracts part of speech in described each short sentence and be noun and verb describedly has an independent semantic word;
    C14. the word combination in described each short sentence that extraction is obtained is as the described relation schema of described short sentence.
  7. According to claim 5, or 6 described control methods, it is characterized in that, described step c2 also comprises the steps:
    C21. described relation schema and described relational model are mated, the described relation schema with identical described relational model is divided into one group;
    C22. identical described relation schema in each group is merged, and the frequency of cumulative described relation schema;
    C23. different described relation schema in each group being carried out similarity calculates;
    C24. the described relation schema that described similarity is surpassed first threshold merges, and the frequency of cumulative described relation schema;
    C25. all are gathered relation schema as institute's predicate through the described relation schema of above-mentioned union operation, wherein said each word is to the corresponding frequency values of relation schema.
  8. According to claim 1, each described control method in 7 is characterized in that, described steps d comprises the steps:
    D1. extract institute's predicate to the first relative set described in the relation schema set;
    D2. cluster is carried out in described the first relative set one time, to obtain described the second relative set.
  9. Control method according to claim 8 is characterized in that, described steps d 1 also comprises the steps:
    D11. extract institute's predicate to each word described in the relation schema set to the relative in the relation schema, wherein, described relative be institute's predicate in the relation schema except the external word of institute's predicate;
    D12. all described relatives are gathered as described the first relative, wherein, the corresponding frequency values of described each relative, described frequency values is the frequency that described relative place institute predicate occurs relation schema.
  10. According to claim 8, or 9 described control methods, it is characterized in that, described steps d 2 also comprises the steps:
    D21. identical described relative in described the first relative set is merged, and cumulative described frequency values corresponding to described relative;
    D22. will sort according to described frequency values through the described relative of above-mentioned merging;
    D23. will gather as described the second relative through the described relative set of above-mentioned ordering.
  11. According to claim 1, each described control method in 10 is characterized in that, described step e comprises the steps:
    E1. the described relative in described the second relative set is divided into groups;
    E2. with relatival described frequency values is the highest described in every group described relative as candidate word;
    E3. described every group of candidate word set of selecting is as the described first middle relative set;
    Control method according to claim 11 is characterized in that, described step e1 also comprises the steps:
    E11. the described relative that the described frequency values in described the second relative set is the highest is as centre word;
    E12. the described all relatives except described centre word in described the second relative set and described centre word are carried out similarity calculating;
    E13. the described relative that described similarity is identical is divided into one group.
  12. According to claim 2, each described control method in 12 is characterized in that, described step f comprises the steps:
    F1. the described all relatives in the described first middle relative set are carried out in twos similarity calculating;
    F2. the described relative that described similarity is surpassed Second Threshold merges, and cumulative described frequency values corresponding to described relative;
    F3. gather as the described second middle relative through the set of the relative after the above-mentioned merging.
  13. According to claim 1, each described control method in 13 is characterized in that, also comprises the steps: before the described step g
    I1. whether the relative set is described target word set in the middle of judging described second.
  14. If i2. the described second middle relative set is not described target word set, then continue execution in step g.
  15. Control method according to claim 14 is characterized in that, also comprises the steps: after described step I 2
    If i3. the described second middle relative set is described target word set, then execution in step h.
CN2011104154039A 2011-12-13 2011-12-13 Analogy retrieval control method based on Chinese word pair relationship similarity Pending CN102955837A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2011104154039A CN102955837A (en) 2011-12-13 2011-12-13 Analogy retrieval control method based on Chinese word pair relationship similarity

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2011104154039A CN102955837A (en) 2011-12-13 2011-12-13 Analogy retrieval control method based on Chinese word pair relationship similarity

Publications (1)

Publication Number Publication Date
CN102955837A true CN102955837A (en) 2013-03-06

Family

ID=47764646

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2011104154039A Pending CN102955837A (en) 2011-12-13 2011-12-13 Analogy retrieval control method based on Chinese word pair relationship similarity

Country Status (1)

Country Link
CN (1) CN102955837A (en)

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103761225A (en) * 2014-01-23 2014-04-30 天津大学 Chinese term semantic similarity calculating method driven by data
CN104182386A (en) * 2013-05-27 2014-12-03 华东师范大学 Word pair relation similarity calculation method
CN105095222A (en) * 2014-04-25 2015-11-25 阿里巴巴集团控股有限公司 Unit word replacing method, search method and replacing apparatus
CN105528441A (en) * 2015-12-22 2016-04-27 北京奇虎科技有限公司 Automatic marking based head word extracting method and device
CN106354715A (en) * 2016-09-28 2017-01-25 医渡云(北京)技术有限公司 Method and device for medical word processing
CN108491393A (en) * 2018-03-29 2018-09-04 国信优易数据有限公司 A kind of emotion word emotional intensity side of determination and device
CN108921741A (en) * 2018-04-27 2018-11-30 广东机电职业技术学院 A kind of internet+foreign language expansion learning method
CN109308299A (en) * 2018-09-12 2019-02-05 北京字节跳动网络技术有限公司 Method and apparatus for searching for information
CN111444713A (en) * 2019-01-16 2020-07-24 清华大学 Method and device for extracting entity relationship in news event
CN111753060A (en) * 2020-07-29 2020-10-09 腾讯科技(深圳)有限公司 Information retrieval method, device, equipment and computer readable storage medium
CN113609304A (en) * 2021-07-20 2021-11-05 广州大学 Entity matching method and device

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1489739A (en) * 2001-01-29 2004-04-14 �ֹ��� System for providing information converted in response to search request and method for using computer

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1489739A (en) * 2001-01-29 2004-04-14 �ֹ��� System for providing information converted in response to search request and method for using computer

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
MAKOTO P. KATO 等: "《Query by Analogical Example:Relational Search Using Web Search Engine Indices》", 《CIKM’09》 *
NGUYEN TUAN DUC, DANUSHKA BOLLEGALA, MITSURU ISHIZUKA: "《Using Relational Similarity between Word Pairs for Latent Relational Search on the Web》", 《2010 IEEE/WIC/ACM INTERNATIONAL CONFERENCE ON WEB INTELLIGENCE AND INTELLIGENT AGENT TECHNOLOGY》 *

Cited By (18)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104182386A (en) * 2013-05-27 2014-12-03 华东师范大学 Word pair relation similarity calculation method
CN103761225B (en) * 2014-01-23 2017-03-29 天津大学 A kind of Chinese word semantic similarity calculation method of data-driven
CN103761225A (en) * 2014-01-23 2014-04-30 天津大学 Chinese term semantic similarity calculating method driven by data
CN105095222B (en) * 2014-04-25 2019-10-15 阿里巴巴集团控股有限公司 Uniterm replacement method, searching method and device
CN105095222A (en) * 2014-04-25 2015-11-25 阿里巴巴集团控股有限公司 Unit word replacing method, search method and replacing apparatus
CN105528441A (en) * 2015-12-22 2016-04-27 北京奇虎科技有限公司 Automatic marking based head word extracting method and device
CN106354715A (en) * 2016-09-28 2017-01-25 医渡云(北京)技术有限公司 Method and device for medical word processing
CN106354715B (en) * 2016-09-28 2019-04-16 医渡云(北京)技术有限公司 Medical vocabulary processing method and processing device
CN108491393A (en) * 2018-03-29 2018-09-04 国信优易数据有限公司 A kind of emotion word emotional intensity side of determination and device
CN108921741A (en) * 2018-04-27 2018-11-30 广东机电职业技术学院 A kind of internet+foreign language expansion learning method
CN109308299A (en) * 2018-09-12 2019-02-05 北京字节跳动网络技术有限公司 Method and apparatus for searching for information
CN109308299B (en) * 2018-09-12 2020-01-14 北京字节跳动网络技术有限公司 Method and apparatus for searching information
CN111444713A (en) * 2019-01-16 2020-07-24 清华大学 Method and device for extracting entity relationship in news event
CN111444713B (en) * 2019-01-16 2022-04-29 清华大学 Method and device for extracting entity relationship in news event
CN111753060A (en) * 2020-07-29 2020-10-09 腾讯科技(深圳)有限公司 Information retrieval method, device, equipment and computer readable storage medium
CN111753060B (en) * 2020-07-29 2023-09-26 腾讯科技(深圳)有限公司 Information retrieval method, apparatus, device and computer readable storage medium
CN113609304A (en) * 2021-07-20 2021-11-05 广州大学 Entity matching method and device
CN113609304B (en) * 2021-07-20 2023-05-23 广州大学 Entity matching method and device

Similar Documents

Publication Publication Date Title
Zhang et al. Ad hoc table retrieval using semantic similarity
CN102955837A (en) Analogy retrieval control method based on Chinese word pair relationship similarity
CN106649455B (en) Standardized system classification and command set system for big data development
CN103955529B (en) A kind of internet information search polymerize rendering method
Cafarella et al. Web-scale extraction of structured data
Wu et al. PTUM: Pre-training user model from unlabeled user behaviors via self-supervision
CN101692223A (en) Refining a search space inresponse to user input
CN103488648A (en) Multilanguage mixed retrieval method and system
CN101408885A (en) Modeling topics using statistical distributions
Bin et al. Web mining research
CN101350027A (en) Content retrieving device and retrieving method
CN104657376A (en) Searching method and searching device for video programs based on program relationship
Ahmadi et al. Unsupervised matching of data and text
Zhang et al. Semantic table retrieval using keyword and table queries
JP2023066404A (en) Method and system for performing product matching on e-commerce platform
Khalid et al. An effective scholarly search by combining inverted indices and structured search with citation networks analysis
Moreira et al. Using rank aggregation for expert search in academic digital libraries
Wu et al. Searching online book documents and analyzing book citations
CN106168947A (en) A kind of related entities method for digging and system
Wang et al. Scalable semantic querying of text
Paparizos et al. Answering web queries using structured data sources
JP5450135B2 (en) Retrieval modeling system and method using relevance dictionary
Choi et al. Consento: a new framework for opinion based entity search and summarization
Ren et al. Role-explicit query extraction and utilization for quantifying user intents
Pakojwar et al. Web data extraction and alignment using tag and value similarity

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C02 Deemed withdrawal of patent application after publication (patent law 2001)
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20130306