CN103246697A - Method and equipment for determining near-synonymy sequence clusters - Google Patents

Method and equipment for determining near-synonymy sequence clusters Download PDF

Info

Publication number
CN103246697A
CN103246697A CN201310105086XA CN201310105086A CN103246697A CN 103246697 A CN103246697 A CN 103246697A CN 201310105086X A CN201310105086X A CN 201310105086XA CN 201310105086 A CN201310105086 A CN 201310105086A CN 103246697 A CN103246697 A CN 103246697A
Authority
CN
China
Prior art keywords
sequence
nearly adopted
cluster
nearly
adopted sequence
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201310105086XA
Other languages
Chinese (zh)
Other versions
CN103246697B (en
Inventor
戴帅湘
徐犇
谢毓彬
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Baidu Netcom Science and Technology Co Ltd
Original Assignee
Beijing Baidu Netcom Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Baidu Netcom Science and Technology Co Ltd filed Critical Beijing Baidu Netcom Science and Technology Co Ltd
Priority to CN201310105086.XA priority Critical patent/CN103246697B/en
Publication of CN103246697A publication Critical patent/CN103246697A/en
Application granted granted Critical
Publication of CN103246697B publication Critical patent/CN103246697B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Abstract

The invention aims to provide a method and equipment for determining near-synonymy sequence clusters. The method specifically includes acquiring a plurality of near-synonymy sequence pairs; determining initial near-synonymy sequence clusters corresponding to the near-synonymy sequence pairs; and clustering sequences in the initial near-synonymy sequence clusters according to feature vectors of the sequences in the initial near-synonymy sequence clusters to acquire one or a plurality of near-synonymy sequence clusters. Compared with the prior art, the method and the equipment have the advantages that the initial near-synonymy sequence clusters corresponding to the near-synonymy sequence pairs are determined, the sequences in the initial near-synonymy sequence clusters are clustered according to the feature vectors of the sequences in the initial near-synonymy sequence clusters to acquire the near-synonymy sequence clusters, accordingly, the near-synonymy sequence clusters can be accurately determined, the information acquisition efficiency is improved for users, and the search experience is enhanced for the users.

Description

A kind of for the method and apparatus of determining nearly adopted sequence cluster
Technical field
The present invention relates to Internet technical field, relate in particular to a kind of for the technology of determining nearly adopted sequence cluster.
Background technology
Current, along with the development of Internet technology and internet, applications to user learning, work and the infiltration of living, people obtain information by network more and more, and as by search engine input inquiry sequence, search engine returns to the Search Results that user and search sequence are complementary.Yet, during different user search identical content, the search sequence of input is also incomplete same, as adopt the form of presentation difference but search sequence equivalent in meaning, and existing search engine is when carrying out matching inquiry according to these search sequence, do not contain the nearly justice relation between these search sequence, the Search Results that returns is also incomplete same, has influenced efficient and accuracy of information that the user obtains information.
Summary of the invention
The purpose of this invention is to provide a kind of for the method and apparatus of determining nearly adopted sequence cluster.
According to an aspect of the present invention, provide a kind of for the method for determining nearly adopted sequence cluster, wherein, this method may further comprise the steps:
It is right that a obtains a plurality of nearly adopted sequences;
B determines that described a plurality of nearly adopted sequence is to corresponding initial nearly adopted sequence cluster;
C carries out clustering processing to the sequence in the described initial nearly adopted sequence cluster, to obtain one or more nearly adopted sequence clusters according to the proper vector of sequence in the described initial nearly adopted sequence cluster.
According to another aspect of the present invention, also provide a kind of for definite nearly adopted sequence cluster locking equipment really, wherein, this determines that equipment comprises:
Deriving means, it is right to be used for obtaining a plurality of nearly adopted sequences;
Initial definite device is used for determining that described a plurality of nearly adopted sequences are to corresponding initial nearly adopted sequence cluster;
The sequence cluster deriving means is used for the proper vector according to described initial nearly adopted sequence cluster sequence, the sequence in the described initial nearly adopted sequence cluster is carried out clustering processing, to obtain one or more nearly adopted sequence clusters.
According to a further aspect of the invention, also provide a kind of for the search engine of determining nearly adopted sequence cluster, wherein, this search engine comprises as described above and to be used for determining nearly adopted sequence cluster locking equipment really according to one aspect of the invention.
According to also aspect of the present invention, also provide a kind of for the search engine plug-in unit of determining nearly adopted sequence cluster, wherein, this search engine plug-in unit comprises as described above and to be used for determining nearly adopted sequence cluster locking equipment really according to one aspect of the invention.
Compared with prior art, the present invention is by determining that a plurality of nearly adopted sequences are to corresponding initial nearly adopted sequence cluster, with the proper vector according to sequence in the described initial nearly adopted sequence cluster, sequence in the described initial nearly adopted sequence cluster is carried out clustering processing, obtain one or more nearly adopted sequence clusters, thereby make and to determine nearly adopted sequence cluster more exactly, not only improved efficient and accuracy of information that the user obtains information, also promoted user's search experience.And, the present invention also can be according to described nearly adopted sequence cluster, set up or adopted sequence library more recently, and detect sequence in the described nearly adopted sequence cluster and whether be present in other nearly adopted sequence clusters in the nearly adopted sequence library, if exist, this sequence is gone redundant the processing, upgrading described nearly adopted sequence library, thereby determine nearly adopted sequence cluster more exactly, improve the accuracy of described nearly adopted sequence cluster, improve the efficient that the user obtains information further, promoted user's search experience.In addition, the present invention also can be according to described nearly adopted sequence cluster and one group of corresponding preferred Search Results thereof, set up or upgrade described nearly adopted sequence library, wherein, described nearly adopted sequence cluster is corresponding to one group of preferred Search Results, make and in described nearly adopted sequence library, to carry out matching inquiry, to obtain the target nearly adopted sequence cluster corresponding with described search sequence, with at least one offers described user in the corresponding one group of preferred Search Results of the nearly adopted sequence cluster of described target, thereby further improved the efficient that the user obtains information, promoted user's search experience.
Description of drawings
By reading the detailed description of doing with reference to the following drawings that non-limiting example is done, it is more obvious that other features, objects and advantages of the present invention will become:
Fig. 1 illustrates the equipment synoptic diagram that is used for determining nearly adopted sequence cluster according to one aspect of the invention;
Fig. 2 illustrates the sequence connection diagram that is used for determining nearly adopted sequence cluster according to one aspect of the invention;
Fig. 3 illustrates the sequence connection renewal synoptic diagram corresponding with Fig. 2 that is used for determining nearly adopted sequence cluster according to one aspect of the invention;
Intensive sequence cluster be considered as node summit after corresponding with Fig. 3 that is used for definite nearly adopted sequence cluster that Fig. 4 illustrates according to one aspect of the invention merges synoptic diagram;
The node behind the summit that intensive sequence cluster is considered as that is used for definite nearly adopted sequence cluster that Fig. 5 illustrates according to one aspect of the invention merges synoptic diagram;
Fig. 6 illustrates according to the node set synoptic diagram after the node merging corresponding with Fig. 5 that is used for definite nearly adopted sequence cluster of one aspect of the invention;
Fig. 7 illustrates the equipment synoptic diagram that is used for determining nearly adopted sequence cluster in accordance with a preferred embodiment of the present invention;
Fig. 8 illustrates the method flow diagram that is used for determining nearly adopted sequence cluster according to a further aspect of the present invention;
Fig. 9 illustrates the method flow diagram that is used for determining nearly adopted sequence cluster in accordance with a preferred embodiment of the present invention.
Same or analogous Reference numeral represents same or analogous parts in the accompanying drawing.
Embodiment
Below in conjunction with accompanying drawing the present invention is described in further detail.
Fig. 1 illustrate according to one aspect of the invention be used for to determine nearly adopted sequence cluster locking equipment 1 really, wherein, determine that equipment 1 comprises deriving means 11, initially determines device 12 and sequence cluster deriving means 13.Particularly, to obtain a plurality of nearly adopted sequences right for deriving means 11; Initial definite device 12 determines that described a plurality of nearly adopted sequences are to corresponding initial nearly adopted sequence cluster; Sequence cluster deriving means 13 carries out clustering processing to the sequence in the described initial nearly adopted sequence cluster, to obtain one or more nearly adopted sequence clusters according to the proper vector of sequence in the described initial nearly adopted sequence cluster.At this, determine that equipment 1 includes but not limited to that the network equipment, subscriber equipment or the network equipment and subscriber equipment are by the mutually integrated equipment that constitutes of network.At this, the described network equipment includes but not limited to as network host, single network server, a plurality of webserver collection or based on the realizations such as set of computers of cloud computing; Perhaps realized by subscriber equipment.At this, cloud is by constituting based on a large amount of main frames of cloud computing (Cloud Computing) or the webserver, and wherein, cloud computing is a kind of of Distributed Calculation, a super virtual machine of being made up of the loosely-coupled computing machine collection of a group.At this, described subscriber equipment can be any electronic product that can carry out man-machine interaction by modes such as keyboard, mouse, touch pad, touch-screen or hand-written equipment with the user, for example computing machine, mobile phone, PDA, palm PC PPC or panel computer etc.Described network includes but not limited to internet, wide area network, Metropolitan Area Network (MAN), LAN (Local Area Network), VPN network, wireless self-organization network (Ad Hoc network) etc.Those skilled in the art will be understood that above-mentioned definite equipment 1 is only for giving an example; other network equipments existing or that may occur from now on or subscriber equipment are as applicable to the present invention; also should be included in the protection domain of the present invention, and be contained in this at this with way of reference.At this, the network equipment and subscriber equipment include a kind of can be according to the instruction of prior setting or storage, automatically carry out the electronic equipment of numerical evaluation and information processing, its hardware includes but not limited to microprocessor, special IC (ASIC), programmable gate array (FPGA), digital processing unit (DSP), embedded device etc.
Particularly, deriving means 11 obtains a plurality of search daily records; Then, semantic analysis is carried out in these a plurality of search daily records handled, it is right to obtain a plurality of nearly adopted sequences.At this, described nearly adopted sequence to include but not limited to following at least each: 1) title is different but synonym search sequence equivalent in meaning that express is right, as " Expert English language training by qualified teachers " and " English training "; 2) the near adopted search sequence of similar import is right, as " Expert English language training by qualified teachers " and " foreign language training ".Those skilled in the art will be understood that above-mentioned relevant inquiring sequence only for giving an example, and other nearly adopted sequences existing or that may occur from now on also should be included in the protection domain of the present invention, and be contained in this at this with way of reference as applicable to the present invention.At this, deriving means 11 obtain the right mode of described a plurality of nearly adopted sequences include but not limited to following at least each:
1) deriving means 11 at first by third party's equipment such as search engine, browser provide obtain the search daily record application programming interfaces (API), obtain a plurality of search daily records; Then, semantic analysis is carried out in these a plurality of search daily records handled, it is right to obtain a plurality of nearly adopted sequences.For example, deriving means 11 is by the application programming interfaces (API) that obtain the search daily record that provide of search engine, get access to a plurality of search daily records, as in certain period, search that the user submits to has comprised which keyword, user have clicked the Search Results which returns etc.; Then, deriving means 11 carries out semantic analysis to the search sequence in these search daily records to be handled, it is right to obtain a plurality of nearly adopted sequences, right as the synonym sequence of being made up of as " Expert English language training by qualified teachers ", " English training ", " Expert English language training by qualified teachers ", " education on foreign language " etc. the keyword that belongs to the synonym near synonym with keyword " Expert English language training by qualified teachers ".
2) deriving means 11 at first by third party's equipment such as search engine, browser provide obtain the search daily record application programming interfaces (API), obtain a plurality of search daily records; Then, deriving means 11 obtains one or more searching record again from described a plurality of search daily records, and wherein, described searching record comprises corresponding search sequence and Search Results; Then, deriving means 11 is again according to described one or more searching record, and it is right to obtain a plurality of nearly adopted sequences.At this, deriving means 11 is according to described one or more searching record, obtaining the right mode of a plurality of nearly adopted sequences includes but not limited to: i) according to the Search Results of described searching record correspondence, by the summary texts of the Search Results of described searching record correspondence such as Search Results correspondence, title link text, page body matter etc. are carried out semantic analysis, to the processing of classifying of described one or more searching record, right to obtain described a plurality of nearly adopted sequence, wherein, described a plurality of nearly adopted sequence is to comprising the search sequence that belongs to of a sort searching record.For example, suppose that deriving means 11 gets access to a plurality of search sequence that record in the search daily record, and the Search Results of each search sequence correspondence is following searching record I to VII:
I " Expert English language training by qualified teachers ":
" EF Englishtown official website, global distinguished Expert English language training by qualified teachers expert "
" Expert English language training by qualified teachers-Wei Bo English allows study English and becomes so simple! "
II " Expert English language training by qualified teachers ":
" the hot luxurious most solemn of ceremonies on Christmas is namely enjoyed in the Expert English language training by qualified teachers registration "
" EF Englishtown official website, global distinguished Expert English language training by qualified teachers expert "
" New Orient IELTS training "
III " English training ":
" Beijing Expert English language training by qualified teachers Wei Bo English-we are absorbed in adult's Expert English language training by qualified teachers! (official website) "
" the Expert English language training by qualified teachers Beijing IELTS training of New Orient, Beijing is entrusted training Beijing to prepare for the postgraduate qualifying examination to train and is gone abroad ... "
IV " fresh flower ":
" 3 hours at first Chinese fresh flower nets of fresh flower of fresh flower! "
" warm fresh flower net fresh flower "
V " fresh flower express delivery ":
" fresh flower, I only choose state's fresh flower express delivery net! 100% quality guarantee "
" send and take fresh flower express delivery fresh flower net everyday "
VI " dangerous forest thoughts ":
" piggy diary: " dangerous forest " thoughts-taste conversation-literature and art-Sohu's circle "
" [new information] reads " dangerous forest " thoughts-lovely piggy-Sohu's blog "
" " dangerous forest "-reaction to an article-NetCash chess/card game is downloaded "
VII " dangerous forest thoughts ":
" piggy diary: " dangerous forest " thoughts-taste conversation-literature and art-Sohu's circle "
" yellow quiet firm five (5) _ Baidu libraries of dangerous forest reaction to an article "
" [new information] reads " dangerous forest " thoughts-lovely piggy-Sohu's blog "
" " dangerous forest "-reaction to an article-NetCash chess/card game is downloaded ",
Then deriving means 11 is by carrying out semantic analysis such as Search Results such as the corresponding title link text of Search Results to searching record I to V correspondence, to the processing of classifying of the Search Results of searching record I to VII correspondence, obtaining the classification of searching record I to VII: 1.. searching record I to III is relevant, and it is classified as a class; 2.. searching record IV is relevant with V, and it is classified as another kind of; 3.. searching record VI is relevant with VII, and it is classified as a class; Then, deriving means 11 is according to the searching record classification that obtains, the search sequence that will belong to of a sort searching record is right as nearly adopted sequence, right as obtaining a plurality of near adopted sequence corresponding with searching record I to III, as pairs1 " Expert English language training by qualified teachers " with " English training ", as pairs2 " Expert English language training by qualified teachers " and " Expert English language training by qualified teachers ", the near adopted sequence corresponding with searching record IV and V is to pairs3, as " fresh flower " and fresh flower express delivery "; the near adopted sequence corresponding with searching record VI and VII is to pairs4, as " dangerous forest thoughts " and " dangerous forest thoughts ".
Ii) to the processing of classifying of the search sequence of described one or more searching record correspondences, right to obtain described a plurality of nearly adopted sequence, wherein, described a plurality of nearly adopted sequences are to comprising the search sequence that belongs to of a sort searching record.For example, connect example, deriving means 11 is by semantic analysis, to the processing of classifying of the search sequence of its searching record I to V correspondence of obtaining, obtain one or more synonym sequence clusters, wherein, described a plurality of nearly adopted sequence is to comprising the search sequence that belongs to of a sort searching record, it is right to obtain a plurality of near adopted sequence corresponding with searching record I to III as deriving means 11, as pairs1 " Expert English language training by qualified teachers " and " English training ", as pairs2 " Expert English language training by qualified teachers " and " Expert English language training by qualified teachers ", the near adopted sequence corresponding with searching record IV and V is to pairs3, as " fresh flower " and " fresh flower express delivery ", the near adopted sequence corresponding with searching record VI and VII is to pairs4, as " dangerous forest thoughts " and " dangerous forest thoughts ".
Iii) that the Search Results of search sequence in the described searching record and correspondence thereof is right as described nearly adopted sequence.For example, suppose that deriving means 11 gets access to that user A has clicked Search Results in a plurality of Search Results that are complementary with search sequence " Expert English language training by qualified teachers " as " Beijing Expert English language training by qualified teachers Wei Bo English-we are absorbed in adult's Expert English language training by qualified teachers in the search daily record! (official website) ", then " Beijing Expert English language training by qualified teachers Wei Bo English-we are absorbed in adult's Expert English language training by qualified teachers to deriving means 11 with Search Results! (official website) " corresponding title is as sequence, and to constitute described nearly adopted sequence right with sequence " Expert English language training by qualified teachers ".
Iv) the different search sequence of correspondence are right as described nearly adopted sequence as a result with same search in the described searching record.For example, suppose that deriving means 11 gets access to that user A has clicked Search Results in a plurality of Search Results that are complementary with search sequence " Expert English language training by qualified teachers " as " Beijing Expert English language training by qualified teachers Wei Bo English-we are absorbed in adult's Expert English language training by qualified teachers in the search daily record! (official website) "; and user B is according to search sequence " foreign language training " when searching for, and also clicked Search Results in a plurality of Search Results that search sequence " foreign language training " is complementary as " Beijing Expert English language training by qualified teachers Wei Bo English-we are absorbed in adult's Expert English language training by qualified teachers! (official website) ", then deriving means 11 is right with the described nearly adopted sequence of " foreign language training " formation with sequence " Expert English language training by qualified teachers ".
3) right according to a plurality of nearly adopted sequences of candidate that marked nearly justice relation, to the degree of correlation information between two sequences that comprise, to carrying out Screening Treatment, right to obtain described nearly adopted sequence to the nearly adopted sequence of described a plurality of candidates in conjunction with the nearly adopted sequence of described candidate.Particularly, deriving means 11 is at first determined the degree of correlation information between two sequences of the nearly adopted sequence centering of described candidate, as degree of correlation information as described in determining by the text matches degree of two sequences, perhaps, by the frequency information that is associated of two sequences described in the search daily record, the number of times that occurs corresponding to same Search Results as these two sequences etc.; Then, deriving means 11 is according to described degree of correlation information, and is to carrying out Screening Treatment, as the nearly adopted sequence deletion of the candidate who degree of correlation information is lower than predetermined threshold, right to obtain described nearly adopted sequence to the nearly adopted sequence of described a plurality of candidates.For example, a plurality of nearly adopted sequences of candidate that marked nearly justice relation of supposing that deriving means 11 gets access to are to as follows:
Pairs1 " Expert English language training by qualified teachers " and " English training "
Pairs2 " Expert English language training by qualified teachers " and " Expert English language training by qualified teachers "
Pairs3 " fresh flower " and " fresh flower express delivery "
Pairs4 " dangerous forest thoughts " and " dangerous forest thoughts "
Then deriving means 11 is at first determined the degree of correlation information between two sequences of the nearly adopted sequence centering of described candidate, as degree of correlation information as described in determining by the text matches degree of two sequences, the degree of correlation information that obtains between two sequences of the nearly adopted sequence centering of above-mentioned candidate is respectively 0.75,1,0.5,1, then deriving means 11 is according to described degree of correlation information, to the nearly adopted sequence of described a plurality of candidates to carrying out Screening Treatment, as degree of correlation information being lower than predetermined threshold as the nearly adopted sequence pairs3 deletion of 0.7 candidate, it is right to obtain described nearly adopted sequence, comprise pairs1 " Expert English language training by qualified teachers " and " English training ", pairs2 " Expert English language training by qualified teachers " and " Expert English language training by qualified teachers ", pairs4 " dangerous forest thoughts " and " dangerous forest thoughts ".
Preferably, deriving means 11 can be at first according to a plurality of search daily records, and it is right to obtain a plurality of sequence results; Then, according to the related information between a plurality of Search Results of described a plurality of sequence results centerings, right to filtering out a plurality of nearly adopted sequences the included sequence from described a plurality of sequence results.For example, deriving means 11 is by the application programming interfaces (API) that obtain the search daily record that third party's equipment such as search engine, browser provide, and it is right to get access to a plurality of sequence results that search records in the daily record, as above-mentioned searching record I to VII; Then, deriving means 11 is by carrying out semantic analysis such as Search Results such as the corresponding title link text of Search Results to searching record I to VII correspondence, by the number of times of determining that identical or close text occurs in the corresponding title link text of Search Results, determine the degree of correlation between a plurality of Search Results among the searching record I to VII, to determine the related information between a plurality of Search Results among the searching record I to VII, thereby obtaining the classification of searching record I to VII: 1.. searching record I to III is relevant, and it is classified as a class; 2.. searching record IV is relevant with V, and it is classified as another kind of; 3.. searching record VI is relevant with VII, and it is classified as a class; Then, deriving means 11 is according to the searching record classification that obtains, right to filtering out a plurality of nearly adopted sequences the included sequence from described a plurality of sequence results, right as nearly adopted sequence as belonging to the corresponding search sequence of of a sort Search Results, right as obtaining a plurality of near adopted sequence corresponding with searching record I to III, as pairs1 " Expert English language training by qualified teachers " and " English training ", as pairs2 " Expert English language training by qualified teachers " and " Expert English language training by qualified teachers ", the near adopted sequence corresponding with searching record IV and V is to pairs3, as " fresh flower " and " fresh flower express delivery ", the near adopted sequence corresponding with searching record VI and VII is to pairs4, as " dangerous forest thoughts " and " dangerous forest thoughts ".
Those skilled in the art will be understood that the above-mentioned right mode of a plurality of nearly adopted sequences of obtaining is only for giving an example; other existing or right modes of a plurality of nearly adopted sequences of obtaining that may occur from now on are as applicable to the present invention; also should be included in the protection domain of the present invention, and be contained in this at this with way of reference.
Initial determine device 12 according to described a plurality of nearly adopted sequences between related information, as semantic feature of nearly adopted sequence etc., determine that described a plurality of nearly adopted sequence is to corresponding initial nearly adopted sequence cluster, as the same or analogous nearly adopted sequence of semantic feature is combined, obtain described initial nearly adopted sequence cluster.For example, connect example, the sequence of described a plurality of nearly adopted sequence centerings that initial definite 12 pairs of deriving means of device 11 obtain is carried out semantic analysis, obtain nearly adopted sequence semantic same or similar to pair1 and the corresponding sequence of pair2, then initially determine device 12 nearly adopted sequence pair1 and the corresponding sequence of pair2 are merged, obtain initial nearly adopted sequence cluster cluster1, it comprises " Expert English language training by qualified teachers ", " Expert English language training by qualified teachers ", " English training ", in like manner, initial definite device 12 also can obtain initial nearly adopted sequence cluster cluster2, it comprises " fresh flower " and fresh flower express delivery "; initial nearly adopted sequence cluster cluster3, it comprises " dangerous forest thoughts " and " dangerous forest thoughts ".
Those skilled in the art will be understood that the mode of above-mentioned definite initial nearly adopted sequence cluster is only for giving an example; other existing or modes of determining initial nearly adopted sequence cluster that may occur from now on are as applicable to the present invention; also should be included in the protection domain of the present invention, and be contained in this at this with way of reference.
Sequence cluster deriving means 13 carries out clustering processing to the sequence in the described initial nearly adopted sequence cluster, to obtain one or more nearly adopted sequence clusters according to the proper vector of sequence in the described initial nearly adopted sequence cluster.Particularly, sequence cluster deriving means 13 is at first determined the proper vector of sequence in the described initial nearly adopted sequence cluster; Then, again according to described proper vector, the sequence in the described initial nearly adopted sequence cluster is carried out clustering processing, to obtain one or more nearly adopted sequence clusters.At this, described proper vector includes but not limited to following each characteristic component at least: 1. .X characteristic component: the vector that the sequence semantic feature information that is obtained after word segmentation processing by described sequence constitutes, the vector that constitutes of the word bag that after word segmentation processing, obtains of sequence as described, as for sequence query1 " Expert English language training by qualified teachers ", obtain " Expert English language training by qualified teachers " behind the participle, corresponding vector can be expressed as { x1: English, x2: training }, wherein, the vectorial coefficient of component xi correspondence is its TFIDF (word frequency-anti-document frequency, term frequency-inverse document frequency) value; For another example, for sequence query2 " ask way, egg menu, the homely egg of egg how to do, menu complete works ", obtain " asking the way egg menu daily life of a family egg of egg how to do the menu complete works " behind the participle, remove stop words, grammer etc., corresponding vector can be expressed as { x1: egg, x2: way, x3: menu, x4: the daily life of a family, x5: complete works }, wherein, the vectorial coefficient of component xi correspondence is its TFIDF value.At this, be that example describes with word " egg ": can obtain the DF value to webpage in enormous quantities (as N piece of writing webpage) with carrying out statistical approximation, for example if word " egg " appears in 10000 pieces of webpages, then its DF value is 10000, and occurred 3 times in the word bag of word " egg " behind participle, then the word frequency in the word bag of word " egg " behind participle is that the TF value is 3/11, thereby the TFIDF value of the correspondence of word " egg " is (3/11) * log (N/10000); 2. .Y characteristic component: the vector that the word bag that is undertaken obtaining behind the participle by title and/or the summary info of the corresponding top n Search Results of described sequence correspondence constitutes.At this, the vectorial coefficient of Y characteristic component correspondence can comprise the historical total click information of the corresponding Search Results of described sequence, average click information etc.At this, determine that the mode of the vector that the mode of vector of Y characteristic component correspondence is corresponding with definite X characteristic component is same or similar, for simplicity's sake, thus do not repeat them here, and comprise by reference therewith; 3. .Z characteristic component: clicked the vector that the historical click information of the Search Results of described sequence correspondence constitutes by the user.At this, the vectorial coefficient of Z characteristic component correspondence can comprise the historical total click information of the corresponding Search Results of described sequence, average click information etc.For example, if for query1, in search daily record record, the user clicks Search Results url11, the url12 of query1 correspondence, the number of clicks of url13 correspondence is respectively 3 times, 4 times, 1 time, and then { url3} represents query1 to availability vector for url1, url2.At this, described proper vector include but not limited to following at least each: 1) directly formed by described characteristic component; 2) according to the weight information of described characteristic component correspondence, weighting obtains described proper vector.Those skilled in the art will be understood that above-mentioned proper vector and characteristic component are only for giving an example; other proper vectors existing or that may occur from now on or characteristic component are as applicable to the present invention; also should be included in the protection domain of the present invention, and be contained in this at this with way of reference.At this, sequence cluster deriving means 13 determine the mode of the proper vector of sequence in the described initial nearly adopted sequence clusters include but not limited to following at least each:
1) according to default described characteristic component, directly form described proper vector by described characteristic component, proper vector can be expressed as described
Figure BDA00002982572700111
Suppose initially to determine that the sequence " Expert English language training by qualified teachers " among the initial near adopted sequence cluster cluster1 that device 12 determines obtains " Expert English language training by qualified teachers ", then characteristic component behind participle
Figure BDA00002982572700112
Can be expressed as { x1: English, x2: training }, be respectively 0.9,0.9 as if the TFIDF value of x1, x2 correspondence, then characteristic component
Figure BDA00002982572700113
For characteristic component
Figure BDA00002982572700114
The click total degree of hypothetical sequence " Expert English language training by qualified teachers " Search Results url1 " EF Englishtown official website; global distinguished Expert English language training by qualified teachers expert " in nearly 200 days search daily record at most as be 10,000 times, behind participle, obtain " the distinguished Expert English language training by qualified teachers expert in the whole world, EF Englishtown official website ", remove stop words, grammer etc., characteristic component
Figure BDA000029825727001114
Can be expressed as { y1: English inspires confidence in, y2: education, y3: English, y4: training, y5: expert }, be respectively 0.7,0.77,0.9,0.9,0.3 as if the TFIDF value of y1, y2, y3, y4, y5 correspondence, then characteristic component Y → = 0.7 y → 1 + 0.77 y → 2 + 0.9 y → 3 + 0.9 y → 4 + 0.3 y → 5 , For characteristic component If sequence " Expert English language training by qualified teachers " in nearly 200 days search daily record Search Results url1 " EF Englishtown official website, global distinguished Expert English language training by qualified teachers expert ", url2 " Expert English language training by qualified teachers-Wei Bo English allows study English and becomes so simple! " corresponding number of clicks is respectively 4 times, 3 times, characteristic component then
Figure BDA00002982572700117
Then sequence cluster deriving means 13 determines that the proper vector of sequence " Expert English language training by qualified teachers " is T → = ( 0.9 x → 1 + 0.9 x → 2 ) + ( 0.7 y → 1 + 0.77 y → 2 + 0.9 y → 3 + 0.9 y → 4 + 0.3 y → 5 ) + ( 4 url → 1 + 3 url → 2 ) .
2) according to default described characteristic component, based on the corresponding weight information of described characteristic component, described proper vector is determined in weighting.For example, also connect example, suppose characteristic component
Figure BDA000029825727001110
With
Figure BDA000029825727001111
Corresponding weight is respectively 0.4,0.2, and then sequence cluster deriving means 13 determines that the proper vector of sequence " Expert English language training by qualified teachers " is T → = 0.4 * ( 0.9 x → 1 + 0.9 x → 2 ) + 0.2 * ( 0.7 y → 1 + 0.77 y → 2 + 0.9 y → 3 + 0.9 y → 4 + 0.3 y → 5 ) + ( 4 url → 1 + 3 url → 2 ) .
Those skilled in the art will be understood that the mode of the proper vector of sequence in above-mentioned definite described initial nearly adopted sequence cluster is only for giving an example; the mode of the proper vector of sequence is as applicable to the present invention in other definite described initial nearly adopted sequence clusters existing or that may occur from now on; also should be included in the protection domain of the present invention, and be contained in this at this with way of reference.
Then, sequence cluster deriving means 13 carries out clustering processing to the sequence in the described initial nearly adopted sequence cluster, to obtain one or more nearly adopted sequence clusters again according to described proper vector.Particularly, sequence cluster deriving means 13 can be according to the included angle cosine value between each characteristic component of the proper vector correspondence of sequence in the described initial nearly adopted sequence cluster; Then according to the included angle cosine value between this each characteristic component, weight information in conjunction with each characteristic component, the included angle cosine value between the proper vector of sequence in the described initial nearly adopted sequence cluster is determined in weighting, to determine the similarity of the sequence in the described initial nearly adopted sequence cluster; Then, sequence cluster deriving means 13 carries out clustering processing to the sequence in the described initial nearly adopted sequence cluster, to obtain one or more nearly adopted sequence clusters according to described similarity.For example, the proper vector of the sequence " Expert English language training by qualified teachers " among the initial near adopted sequence cluster cluster1 that determines of hypothetical sequence bunch deriving means 13, " Expert English language training by qualified teachers ", " the English training " is respectively:
Figure BDA00002982572700121
T → 2 = X → 2 + Y → 2 + Z → 2 , T → 3 = X → 3 + Y → 3 + Z → 3 , Sequence cluster deriving means 13 is at first according to vector
Figure BDA00002982572700124
With
Figure BDA00002982572700125
Included angle cosine value between each corresponding characteristic component as: for
Figure BDA00002982572700126
With
Figure BDA00002982572700127
Between each characteristic component: as for Characteristic component calculates
Figure BDA00002982572700129
For Characteristic component calculates
Figure BDA000029825727001211
For Characteristic component calculates Then sequence cluster deriving means 13 can obtain
Figure BDA000029825727001214
With
Figure BDA000029825727001215
Between similarity as similarity ( T → 1 , T → 2 ) = a * sim 1 + b * sim 2 + c * sim 3 , Wherein, a, b, c is the weight information of character pair component, satisfies a+b+c=1, at this, a, b, c numerical information can determine by machine learning, also can comprise predetermined value, if determine a=0.5, b=0.3, c=0.2, then sequence cluster deriving means 13 can calculate
Figure BDA000029825727001217
With Between similarity be similarity ( T → 1 , T → 2 ) = a * sim 1 + b * sim 2 + c * sim 3 = 0.5 * 0.9 + 0.3 * 0.9 + 0.2 * 0.6 = 0.84 , As 0.8, then sequence cluster deriving means 13 is with proper vector greater than predetermined threshold
Figure BDA000029825727001220
With
Figure BDA000029825727001221
Corresponding respectively sequence " Expert English language training by qualified teachers " and " Expert English language training by qualified teachers " are classified as same nearly adopted sequence cluster such as synonyms-cluster1, and similarly, sequence cluster deriving means 13 can calculate With
Figure BDA000029825727001223
Between similarity be As 0.8, then sequence cluster deriving means 13 is with proper vector less than predetermined threshold
Figure BDA000029825727001225
Corresponding sequence " English training " is classified as another nearly adopted sequence cluster such as synonyms-cluster2.
Those skilled in the art will be understood that the mode of the weight information of above-mentioned definite each characteristic component only is for example; the mode of other existing or weight informations of determining each characteristic component that may occur from now on is as applicable to the present invention; also should be included in the protection domain of the present invention, and be contained in this at this with way of reference.
Preferably, sequence cluster deriving means 13 can be determined the similarity information between the sequence in the described initial nearly adopted sequence cluster at first according to the proper vector of sequence in the described initial nearly adopted sequence cluster; Then, according to described similarity information, the sequence in the described initial nearly adopted sequence cluster is carried out clustering processing, to obtain one or more nearly adopted sequence clusters; Wherein, described proper vector comprises following each characteristic component at least:
The corresponding sequence semantic feature of-described sequence information;
The historical click information of the corresponding Search Results of-described sequence;
The corresponding Search Results summary info of-described sequence.
For example, the proper vector of the sequence " Expert English language training by qualified teachers " among the initial near adopted sequence cluster cluster1 that determines of hypothetical sequence bunch deriving means 13, " Expert English language training by qualified teachers ", " the English training " is respectively T → 1 = X → 1 + Y → 1 , T → 2 = X → 2 + Y → 2 , T → 3 = X → 3 + Y → 3 , Sequence cluster deriving means 13 is at first according to vector
Figure BDA00002982572700134
Figure BDA00002982572700135
With
Figure BDA00002982572700136
Included angle cosine value between each corresponding characteristic component as: for
Figure BDA00002982572700137
With
Figure BDA00002982572700138
Between each characteristic component: as for
Figure BDA00002982572700139
Characteristic component calculates
Figure BDA000029825727001310
For
Figure BDA000029825727001311
Characteristic component calculates
Figure BDA000029825727001312
Then sequence cluster deriving means 13 can obtain according to this cosine value
Figure BDA000029825727001313
With
Figure BDA000029825727001314
Between similarity as similarity ( T → 1 , T → 2 ) = a * sim 1 , + b * sim 2 , , If determine a=0.5, b=0.5, then sequence cluster deriving means 13 can calculate
Figure BDA000029825727001316
With
Figure BDA000029825727001317
Between similarity be similarity ( T → 1 , T → 2 ) = a * sim 1 , + b * sim 2 , = 0.5 * 1 + 0.5 * 1 = 1 , As 0.8, then sequence cluster deriving means 13 is determined proper vector greater than predetermined threshold
Figure BDA000029825727001319
With
Figure BDA000029825727001320
Corresponding respectively sequence " Expert English language training by qualified teachers " and " Expert English language training by qualified teachers " belong to same nearly adopted sequence cluster such as synonyms-cluster1, and then, sequence cluster deriving means 13 is in definite proper vector
Figure BDA000029825727001321
During nearly adopted sequence cluster synonyms-cluster1, similarly, sequence cluster deriving means 13 can calculate With
Figure BDA000029825727001323
Between the included angle cosine value determine With
Figure BDA000029825727001325
Similarity, as obtain
Figure BDA000029825727001326
As 0.8, then sequence cluster deriving means 13 is with proper vector less than predetermined threshold Corresponding sequence " English training " is classified as another nearly adopted sequence cluster such as synonyms-cluster2.
Those skilled in the art will be understood that the mode of the similarity information between the sequence in above-mentioned definite described initial nearly adopted sequence cluster is only for giving an example; the mode of the similarity information in other definite described initial nearly adopted sequence clusters existing or that may occur from now between the sequence is as applicable to the present invention; also should be included in the protection domain of the present invention, and be contained in this at this with way of reference.
Those skilled in the art will be understood that the mode of the described nearly adopted sequence cluster of above-mentioned acquisition is only for giving an example; the mode of the described nearly adopted sequence cluster of other acquisitions existing or that may occur from now on is as applicable to the present invention; also should be included in the protection domain of the present invention, and be contained in this at this with way of reference.
Determine between each device of equipment 1 it is constant work.Particularly, to continue to obtain a plurality of nearly adopted sequences right for deriving means 11; Initial definite device 12 continues to determine that described a plurality of nearly adopted sequences are to corresponding initial nearly adopted sequence cluster; Sequence cluster deriving means 13 continues the proper vector according to sequence in the described initial nearly adopted sequence cluster, the sequence in the described initial nearly adopted sequence cluster is carried out clustering processing, to obtain one or more nearly adopted sequence clusters.At this, those skilled in the art be to be understood that " continuing " refer to determine each device of equipment 1 constantly carry out respectively a plurality of nearly adopted sequences right obtain, the determining and the acquisition of nearly adopted sequence cluster of initial nearly adopted sequence cluster, until determining that equipment 1 stops to obtain of a plurality of relevant inquiring sequences and corresponding a plurality of Search Results in ultra-long time.
Preferably, determine that initially device 12 comprises intensive determining unit (not shown) and sequence merge cells (not shown).Particularly, intensive determining unit based on the label propagation algorithm, determines that described a plurality of nearly adopted sequence is to corresponding intensive sequence cluster according to the corresponding label of described a plurality of nearly adopted sequence centering sequences; The sequence merge cells carries out sequence to the sequence of described a plurality of nearly adopted sequence centerings and merges processing according to described intensive sequence cluster, to obtain described initial nearly adopted sequence cluster.
Particularly, the sequence of described a plurality of nearly adopted sequence centerings that intensive determining unit is at first obtained with deriving means 11 is the summit, and for distributing unique tags in each summit, it is corresponding unique tags of each sequence of described a plurality of nearly adopted sequence centerings, and closing with the nearly justice between described a plurality of nearly adopted sequence centering sequences is the limit, constitutes the sequence connection layout; Then, intensive determining unit is determined the label on each summit based on the label propagation algorithm, as being determined by the maximum label of the frequency of adjacency, and iteration successively, cluster is gathered on the summit of final same label, obtains described a plurality of nearly adopted sequence to corresponding intensive sequence cluster.For example, suppose described a plurality of near adopted sequence that deriving means 11 obtains to comprising query1-query2, query1-query3, query1-query4, query2-query4, query5-query6, query6-query7, query5-query8, intensive determining unit is that the unique tags of sequence allocation of these a plurality of nearly adopted sequence centerings is as shown in table 1 below:
Sequence Corresponding label Sequence Corresponding label
query1 A1 query5 E1
query2 B1 query6 F1
query3 C1 query7 G1
query4 D1 query8 H1
Table 1
Then these a plurality of nearly adopted sequences of obtaining of intensive determining unit to corresponding sequence connection diagram as shown in Figure 2, wherein, solid line is represented sequence to having nearly justice relation among the figure, dotted line represents that sequence is not to having nearly justice relation; Then, intensive determining unit is based on the label propagation algorithm, determine the label on each summit shown in Fig. 2, as being determined by the maximum label of the frequency of summit adjacency, as being example with summit A1, the label of the adjacent node of summit A is B1, C1, D1, suppose B1, C1, the frequency of D1 is 1, then intensive determining unit is determined label such as the B1 that the label of summit A1 is summit A1 in abutting connection with the label of label sequence number maximum such as D1 or summit A1 in abutting connection with the label sequence number minimum, similarly, intensive determining unit is determined summit B1 successively, C1, D1, E1, F1, G1, the label of H1 correspondence is respectively A1, A1, A1, F1, G1, F1, E1, obtain the new sequence connection diagram corresponding with Fig. 2, as shown in Figure 3, then, intensive determining unit is gathered into cluster with the summit of same label, obtain described a plurality of nearly adopted sequence to corresponding intensive sequence cluster, as initial label B 1, C1, the new label of D1 is A1, initial labels E1, the new label of G1 is F1, then intensive determining unit is with initial labels B1, C1, cluster is gathered on the summit of D1 correspondence, obtain intensive sequence cluster such as intensive-cluster1, it comprises initial labels B1, C1, the corresponding sequence query2 of D1, query3 and query4, in like manner, intensive determining unit is with initial labels E1, cluster is gathered on the summit of G1 correspondence, obtain intensive sequence cluster such as intensive-cluster2, it comprises initial labels E1, the corresponding sequence query5 of G1 and query7.
Those skilled in the art will be understood that the mode of above-mentioned definite intensive sequence cluster is only for giving an example; the mode of other definite intensive sequence clusters existing or that may occur from now on is as applicable to the present invention; also should be included in the protection domain of the present invention, and be contained in this at this with way of reference.
Then, the sequence merge cells carries out sequence to the sequence of described a plurality of nearly adopted sequence centerings and merges processing according to described intensive sequence cluster, to obtain described initial nearly adopted sequence cluster.Particularly, the sequence merge cells at first is considered as a summit with described intensive sequence cluster, determine the node to be combined of the corresponding node of sequence of described a plurality of nearly adopted sequence centerings, as with as described in all nodes of linking to each other of the corresponding node of sequence of a plurality of nearly adopted sequence centerings, the node that the node number of degrees are the highest is as described node to be combined, perhaps, when with all nodes that the corresponding node of the sequence of described a plurality of nearly adopted sequence centerings links to each other in when a plurality of node with identical high node number of degrees occurring, then in these a plurality of nodes with identical high node number of degrees, randomly draw a node as described node to be combined, at this, the described node number of degrees refer to the number of the node that node connects; Then, the sequence merge cells is based on the predetermined rule that merges, the sequence of described a plurality of nearly adopted sequence centerings is carried out sequence merge processing, to obtain described initial nearly adopted sequence cluster, at this, described predetermined merging rule include but not limited to following at least each: 1) node that will have an identical node to be combined merges; 2) node of node to be combined is combined each other.
For example, connect example, the intensive sequence cluster that the sequence merge cells is determined intensive determining unit is considered as a summit, obtain intensive sequence cluster be considered as node summit after corresponding with Fig. 3 and merge synoptic diagram, as shown in Figure 4, wherein, the initial labels B1 that intensive sequence cluster intensive-cluster1 comprises, C1, D1 is as a summit, its label is A1, the initial labels E1 that intensive sequence cluster intensive-cluster2 comprises, G1 is as a summit, its label is F1, in Fig. 4, and node { E1, the node to be combined of G1} is identical, node is to { A1, B1} be node to be combined each other, and then the sequence merge cells is with node { E1, G1} merges, with node to { A1, B1} merge, and the set { E1 after will merging, G1, F1} and A1, B1} is considered as node, rebuilds the sequence connection layout, continue XM and merge processing, boundless until any two nodes, final, the sequence merge cells will be gathered { E1, G1, the sequence of F1} correspondence is classified as same nearly adopted sequence cluster, will gather that { A1, the sequence of B1} correspondence is classified as another nearly adopted sequence cluster.
For another example, the intensive sequence cluster that the hypothetical sequence merge cells the is determined intensive determining unit node that is considered as obtaining behind the summit merges synoptic diagram as shown in Figure 5, node { A, B, the node to be combined of D} is identical, and node is to { E, F} is node to be combined each other, and then the sequence merge cells is with node { A, B, D} merges, with node to E, F} merges, and the set { A after will merging, B, D, C} and { E, F} is considered as node, rebuild the sequence connection layout, continue XM and merge, boundless until any two nodes, finally, the sequence merge cells will be gathered { A, B, D, the sequence of C} correspondence is classified as same nearly adopted sequence cluster, to gather that { E, the sequence of F} correspondence is classified as another nearly adopted sequence cluster, as shown in Figure 6.
Those skilled in the art will be understood that it only is for example that above-mentioned sequence to described a plurality of nearly adopted sequence centerings is carried out the mode of sequence merging processing; other sequences to described a plurality of nearly adopted sequence centerings existing or that may occur are from now on carried out mode that sequence merge to handle as applicable to the present invention; also should be included in the protection domain of the present invention, and be contained in this at this with way of reference.
(with reference to figure 1) in a preferred embodiment determines that equipment 1 comprises deriving means 11, initially determines device 12 and sequence cluster deriving means 13, and wherein, sequence cluster deriving means 13 comprises candidate's acquiring unit (not shown) and denoising unit (not shown).Be described below with reference to the preferred embodiment of Fig. 1: particularly, it is right that deriving means 11 obtains a plurality of nearly adopted sequences; Initial definite device 12 determines that described a plurality of nearly adopted sequences are to corresponding initial nearly adopted sequence cluster; Candidate's acquiring unit carries out clustering processing to the sequence in the described initial nearly adopted sequence cluster, to obtain the nearly adopted sequence cluster of one or more candidates according to the proper vector of sequence in the described initial nearly adopted sequence cluster; The denoising unit carries out denoising to the nearly adopted sequence cluster of described candidate, to obtain described nearly adopted sequence cluster.At this, deriving means 11 and initial definite device 12 are same or similar with the content of middle corresponding intrument embodiment illustrated in fig. 1, for simplicity's sake, thus do not repeat them here, and comprise therewith by reference.
Particularly, candidate's acquiring unit carries out clustering processing to the sequence in the described initial nearly adopted sequence cluster, to obtain the nearly adopted sequence cluster of one or more candidates according to the proper vector of sequence in the described initial nearly adopted sequence cluster.At this, candidate's acquiring unit obtains sequence cluster deriving means 13 among the mode of the nearly adopted sequence cluster of described one or more candidates and Fig. 1, and to obtain the mode of described one or more nearly adopted sequence clusters same or similar, for simplicity's sake, do not repeat them here, and comprise therewith by reference.
Then, the denoising unit carries out denoising to the nearly adopted sequence cluster of described candidate, as removes redundant text, to obtain described nearly adopted sequence cluster.For example, suppose that the nearly adopted sequence cluster candidate-cluster of candidate that candidate's acquiring unit obtains comprises that a plurality of sequences are as queryA: " asking the way of egg; the egg menu; how homely egg is done; the menu complete works ", queryB: " homely egg way ", queryC: " how doing homely egg dish ", then the denoising unit is by the sequence queryA to comprising in the nearly adopted sequence cluster of this candidate, the description text of queryB and queryC correspondence carries out semantic analysis, obtain the way that themes as egg of sequence correspondence in the nearly adopted sequence cluster of this candidate, and the text that comprises in the textual description of queryA correspondence " menu complete works " has deviated from this theme, then the denoising unit judges that text is " menu complete works " redundant text, and it is removed from the description text of queryA correspondence, obtain the near adopted sequence cluster of the nearly adopted sequence cluster correspondence of this candidate, it comprises sequence queryA: " asking the way of egg; the egg menu; how homely egg is done ", queryB: " homely egg way ", queryC: " how doing homely egg dish ".
Preferably, the denoising unit also can be according to the proper vector of sequence in the nearly adopted sequence cluster of described candidate and the similarity information of corresponding bunch of proper vector of the nearly adopted sequence cluster of this candidate, the nearly adopted sequence cluster of described candidate is carried out denoising, to obtain described nearly adopted sequence cluster.Particularly, proper vector and corresponding bunch of proper vector of the nearly adopted sequence cluster of this candidate of sequence in the nearly adopted sequence cluster of described candidate at first determined in the denoising unit; Then, determine the proper vector of sequence in the nearly adopted sequence cluster of described candidate and the similarity information of corresponding bunch of proper vector of the nearly adopted sequence cluster of this candidate again, as when as described in the nearly adopted sequence cluster of candidate sequence proper vector and as described in thick proper vector comprise when being the characteristic component of vectorial coefficient with textual description, according to the text matches degree between the vectorial coefficient of the proper vector of sequence in the nearly adopted sequence cluster of this candidate and corresponding bunch of proper vector of the nearly adopted sequence cluster of this candidate, determine described similarity information, perhaps, when the proper vector of sequence in the nearly adopted sequence cluster of described candidate and described thick proper vector do not comprise when being the characteristic component of vectorial coefficient with textual description, according to the proper vector of sequence in the nearly adopted sequence cluster of described candidate and the angle between corresponding bunch of proper vector of the nearly adopted sequence cluster of this candidate, determine described similarity information; Then, according to this similarity information, the nearly adopted sequence cluster of described candidate is carried out denoising, as with the sequence of similarity information less than predetermined threshold, from the nearly adopted sequence cluster of described candidate, delete, to obtain described nearly adopted sequence cluster.
Particularly, proper vector and corresponding bunch of proper vector of the nearly adopted sequence cluster of this candidate of sequence in the nearly adopted sequence cluster of described candidate at first determined in the denoising unit.Particularly, the proper vector of sequence in the nearly adopted sequence cluster of described candidate is at first determined in the denoising unit; Then, the denoising unit is according to the proper vector of sequence in the nearly adopted sequence cluster of described candidate, determine corresponding bunch of proper vector of the nearly adopted sequence cluster of this candidate, as with as described in the nearly adopted sequence cluster of candidate the vectorial coefficient of each characteristic component of the proper vector of sequence mean value as described in the vectorial coefficient of bunch proper vector character pair component.At this, the denoising unit determines that sequence cluster deriving means 13 determines that the mode of the proper vector of sequence in the described initial nearly adopted sequence clusters is same or similar among the mode of the proper vector of sequence in the nearly adopted sequence cluster of described candidate and Fig. 1, for simplicity's sake, so do not repeat them here, and comprise therewith by reference.
Those skilled in the art will be understood that the mode of above-mentioned definite described bunch of proper vector is only for giving an example; the mode of other existing or definite described bunch proper vectors that may occur from now on is as applicable to the present invention; also should be included in the protection domain of the present invention, and be contained in this at this with way of reference.
Then, the denoising unit is determined the proper vector of sequence in the nearly adopted sequence cluster of described candidate and the similarity information of corresponding bunch of proper vector of the nearly adopted sequence cluster of this candidate again.At this, the denoising unit determines that sequence cluster deriving means 13 among the mode of described similarity information and Fig. 1 determines that the mode of the similarity between the proper vector of sequence in the described initial nearly adopted sequence clusters is same or similar, for simplicity's sake, thus do not repeat them here, and comprise therewith by reference.
Then, the denoising unit carries out denoising according to this similarity information to the nearly adopted sequence cluster of described candidate, as with the sequence of similarity information less than predetermined threshold, deletes from the nearly adopted sequence cluster of described candidate, to obtain described nearly adopted sequence cluster.For example, the denoising unit determines that the proper vector of sequence queryA among the nearly adopted sequence cluster candidate-cluster of candidate and the similarity information of corresponding bunch of proper vector of the nearly adopted sequence cluster candidate-cluster of candidate are 0.8,6, less than predetermined threshold 0.9, then the denoising unit is deleted sequence queryA from the nearly adopted sequence cluster candidate-cluster of candidate, obtains described nearly adopted sequence cluster.
Those skilled in the art will be understood that the above-mentioned mode that the nearly adopted sequence cluster of described candidate is carried out denoising is only for for example; other existing or modes that the nearly adopted sequence cluster of described candidate is carried out denoising that may occur from now on are as applicable to the present invention; also should be included in the protection domain of the present invention, and be contained in this at this with way of reference.
In another preferred embodiment, can be with above-mentioned for definite nearly adopted sequence cluster locking equipment 1 really, combine with existing search engine, constitute a kind of new search engine, existing search engine includes but not limited to the Google search engine as Google company, the baidu search engine of company of Baidu etc.
In another preferred embodiment, can be with above-mentioned for definite nearly adopted sequence cluster locking equipment 1 really, combine with existing search engine plug-in unit, constitute a kind of new search engine plug-in unit, the search engine plug-in units such as MSN ToolBar of despot, Microsoft are searched by the existing Baidu of the Google ToolBar as Google company, company of Baidu that includes but not limited to.
Fig. 7 illustrates equipment synoptic diagram that be used for to determine nearly adopted sequence cluster in accordance with a preferred embodiment of the present invention, wherein, determines that equipment 1 comprises deriving means 11 ', initially determines device 12 ', sequence cluster deriving means 13 ' and sequence library apparatus for establishing 14 '.Particularly, to obtain a plurality of nearly adopted sequences right for deriving means 11 '; Initial definite device 12 ' determines that described a plurality of nearly adopted sequence is to corresponding initial nearly adopted sequence cluster; Sequence cluster deriving means 13 ' carries out clustering processing to the sequence in the described initial nearly adopted sequence cluster, to obtain one or more nearly adopted sequence clusters according to the proper vector of sequence in the described initial nearly adopted sequence cluster; Sequence library apparatus for establishing 14 ' is according to described nearly adopted sequence cluster, sets up or adopted sequence library more recently.At this, deriving means 11 ', initially determine device 12 ' and sequence cluster deriving means 13 ' with embodiment illustrated in fig. 1 in the content of corresponding intrument same or similar, for simplicity's sake, so do not repeat them here, and comprise therewith by reference.
Particularly, sequence library apparatus for establishing 14 ' is according to described nearly adopted sequence cluster, sets up or adopted sequence library more recently.For example, near adopted sequence cluster such as synonyms-cluster1 that hypothetical sequence bunch deriving means 13 ' obtains comprise sequence " Expert English language training by qualified teachers " and " Expert English language training by qualified teachers ", then the sequence " Expert English language training by qualified teachers " and " Expert English language training by qualified teachers " that will this nearly adopted sequence cluster synonyms-cluster1 comprise of sequence library apparatus for establishing 14 ' is stored in the nearly adopted sequence library, press certain way and upgrade this nearly adopted sequence library, as according to predetermined period, regularly renewal, immediately upgrade as described in nearly adopted sequence library.
Those skilled in the art will be understood that the mode of the described nearly adopted sequence library of above-mentioned renewal is only for giving an example; the mode of the described nearly adopted sequence library of other renewals existing or that may occur from now on is only as applicable to the present invention; also should be included in the protection domain of the present invention, and be contained in this at this with way of reference.
Preferably, determine that equipment 1 also comprises the pick-up unit (not shown) and removes the redundant apparatus (not shown).Particularly, pick-up unit detects sequence in the described nearly adopted sequence cluster and whether is present in other nearly adopted sequence clusters in the described nearly adopted sequence library; If exist, go redundant apparatus that this sequence is gone redundant the processing, to upgrade described nearly adopted sequence library.
Particularly, pick-up unit detects sequence in the described nearly adopted sequence cluster and whether is present in other nearly adopted sequence clusters in the described nearly adopted sequence library.For example, near adopted sequence cluster such as synonyms-cluster1 that hypothetical sequence bunch deriving means 13 ' obtains comprise sequence " Expert English language training by qualified teachers " and " Expert English language training by qualified teachers ", then pick-up unit compares by the sequence that other the nearly adopted sequence clusters in the described near adopted sequence library of the sequence that comprises among the near adopted sequence cluster synonyms-cluster1 that sequence cluster deriving means 13 ' is obtained and the 14 ' foundation of sequence library apparatus for establishing comprise, and determines that sequence " Expert English language training by qualified teachers " that nearly adopted sequence cluster synonyms-cluster1 comprises and " Expert English language training by qualified teachers " are not present in other the nearly adopted sequence clusters in the described nearly adopted sequence library.
If exist, go redundant apparatus that this sequence is gone redundant the processing, to upgrade described nearly adopted sequence library.For example, connect example, suppose that pick-up unit finds in the described nearly adopted sequence library to exist other nearly adopted sequence clusters such as synonyms-cluster1 ' also to comprise the sequence " Expert English language training by qualified teachers " that is present among the nearly adopted sequence cluster synonyms-cluster1, then go redundant apparatus that this sequence " Expert English language training by qualified teachers " is gone redundant the processing, as the degree of correlation according to the nearly adopted sequence cluster at this repeating sequences " Expert English language training by qualified teachers " and its place, this sequence " Expert English language training by qualified teachers " is retained in the highest near adopted sequence cluster of the degree of correlation, the appearance of deletion in other nearly adopted sequence clusters, so that it is present in some sequence clusters, thus adopted sequence library more recently.The invention enables the described near adopted sequence library after the renewal not have the sequence that belongs to different sequence clusters, improved the accuracy of nearly adopted sequence cluster.
Those skilled in the art will be understood that and above-mentioned sequence gone redundant the processing in the mode of upgrading described nearly adopted sequence library only for for example; other existing or may occur from now on sequence is gone redundant the processing in the mode of upgrading described nearly adopted sequence library as applicable to the present invention; also should be included in the protection domain of the present invention, and be contained in this at this with way of reference.
Those skilled in the art will be understood that in specific embodiment sequence library apparatus for establishing 14 ' can be independently module of phase with pick-up unit with removing redundant apparatus, also can integrate.
Preferably, determine that equipment 1 also comprises the first sequence deriving means (not shown), the first inquiry unit (not shown) and the first generator (not shown).Particularly, the first sequence deriving means obtains the search sequence of user's input; First inquiry unit carries out matching inquiry according to described search sequence in described nearly adopted sequence library, to determine the target nearly adopted sequence cluster corresponding with described search sequence; First generator offers described user with at least one sequence in the nearly adopted sequence cluster of described target, with the recommended items as described search sequence.
Particularly, the first sequence deriving means is by dynamic web page techniques such as ASP, JSP, and perhaps the application programming interfaces that provide by search engine (API) obtain the search sequence of user's input.For example, if search subscriber A imports keyword " Expert English language training by qualified teachers " by its mobile device iphone in the search engine search column, press "enter" key", then the first sequence deriving means just obtains user A by the search sequence " Expert English language training by qualified teachers " of its mobile device iphone input by dynamic web page techniques such as ASP, JSP.
First inquiry unit carries out matching inquiry according to described search sequence in described nearly adopted sequence library, to determine the target nearly adopted sequence cluster corresponding with described search sequence.For example, connect example, the described search sequence that first inquiry unit gets access to according to the first sequence deriving means, in the described near adopted sequence library of the 14 ' foundation of sequence library apparatus for establishing or renewal, carry out matching inquiry, obtain the target nearly adopted sequence cluster corresponding with described search sequence, as with the near adopted sequence cluster at search sequence " Expert English language training by qualified teachers " place as described in the nearly adopted sequence cluster of target, comprise nearly adopted sequences such as " Expert English language training by qualified teachers ", " Expert English language training by qualified teachers " as the near adopted sequence cluster at search sequence " Expert English language training by qualified teachers " place.
First generator passes through such as dynamic web page techniques such as ASP, JSP or PHP, the perhaps communication mode of other agreements, as communication protocols such as http or https, at least one sequence in the nearly adopted sequence cluster of described target is offered described user, with the recommended items as described search sequence.For example, connect example, the near adopted sequence " Expert English language training by qualified teachers ", " Expert English language training by qualified teachers " that comprise in the near adopted sequence cluster of first generator with search sequence " Expert English language training by qualified teachers " place offer user A as recommended items, the confession user browses and selects, as when user A in search column during list entries " Expert English language training by qualified teachers ", at least one sequence is prompted to user A as the recommended items of this sequence " Expert English language training by qualified teachers " with drop-down box form in the nearly adopted sequence cluster of the target corresponding with this sequence " Expert English language training by qualified teachers " that first generator is determined first inquiry unit.
(with reference to figure 7) in a preferred embodiment determines that equipment 1 also comprises the second sequence deriving means (not shown), the second inquiry unit (not shown) and the second generator (not shown).Be described below with reference to the preferred embodiment of Fig. 7: particularly, it is right that deriving means 11 ' obtains a plurality of nearly adopted sequences; Initial definite device 12 ' determines that described a plurality of nearly adopted sequence is to corresponding initial nearly adopted sequence cluster; Sequence cluster deriving means 13 ' carries out clustering processing to the sequence in the described initial nearly adopted sequence cluster, to obtain one or more nearly adopted sequence clusters according to the proper vector of sequence in the described initial nearly adopted sequence cluster; Sequence library apparatus for establishing 14 ' also can be according to described nearly adopted sequence cluster and one group of corresponding preferred Search Results thereof, sets up or upgrades described nearly adopted sequence library, and wherein, described nearly adopted sequence cluster is corresponding to one group of preferred Search Results; The second sequence deriving means obtains the search sequence of user's input; Second inquiry unit carries out matching inquiry according to described search sequence in described nearly adopted sequence library, to obtain the target nearly adopted sequence cluster corresponding with described search sequence; Second generator is with at least one offers described user in the corresponding one group of preferred Search Results of the nearly adopted sequence cluster of described target.At this, deriving means 11 ', initially determine device 12 ' and sequence cluster deriving means 13 ' with embodiment illustrated in fig. 1 in the content of corresponding intrument same or similar, for simplicity's sake, so do not repeat them here, and comprise therewith by reference.
Particularly, the described near adopted sequence cluster that sequence library apparatus for establishing 14 ' at first obtains according to sequence cluster deriving means 13 ', a plurality of Search Results that the sequence with in this nearly adopted sequence cluster that the user that search is recorded in the daily record clicks is complementary are added up, and occurrence number is satisfied Search Results greater than certain threshold value as described one group of preferred Search Results; Then, sequence library apparatus for establishing 14 ' is according to described nearly adopted sequence cluster and one group of corresponding preferred Search Results thereof, sets up or upgrades described nearly adopted sequence library, and wherein, described nearly adopted sequence cluster is corresponding to one group of preferred Search Results.At this, described one group of preferred Search Results comprises the high-quality that is complementary with described nearly adopted sequence cluster, high authority's Search Results, reach the Search Results that really is consistent with user's search need, it can carry out statistical study and draw by user's search being browsed behavior, as the corresponding Search Results of the page that user's browsing time is long as described in preferred Search Results, with the user click the many Search Results of number of visits as described in preferred Search Results etc.
For example, the described near adopted sequence cluster synonyms-cluster1 that hypothetical sequence bunch deriving means 13 ' obtains comprises sequence " Expert English language training by qualified teachers ", " Expert English language training by qualified teachers ", the sequence " Expert English language training by qualified teachers " that comprises according to this nearly adopted sequence cluster synonyms-cluster1 of sequence library apparatus for establishing 14 ' then, " Expert English language training by qualified teachers ", a plurality of Search Results that the sequence with in this nearly adopted sequence cluster that the user who records in the search daily record is clicked is complementary are added up, as occurrence number is satisfied greater than certain threshold value as 2 times Search Results as described in preferred Search Results, then sequence library apparatus for establishing 14 ' can inquiry obtain the corresponding described preferred Search Results of nearly adopted sequence cluster synonyms-cluster1 and comprises as " EF Englishtown official website, global distinguished Expert English language training by qualified teachers expert " from the search daily record; Then, sequence library apparatus for establishing 14 ' nearly adopted sequence cluster synonyms-cluster1 and corresponding described one group of preferred Search Results thereof is stored in the nearly adopted sequence library, press certain way and upgrade this nearly adopted sequence library, as according to predetermined period, regularly renewal, immediately upgrade as described in nearly adopted sequence library.
Those skilled in the art will be understood that the mode of one group of preferred Search Results of above-mentioned definite described nearly adopted sequence cluster correspondence only is for example; the mode of one group of preferred Search Results of other existing or definite described nearly adopted sequence cluster correspondences that may occur from now on is as applicable to the present invention; also should be included in the protection domain of the present invention, and be contained in this at this with way of reference.
The second sequence deriving means obtains the search sequence of user's input.At this, it is same or similar that the search sequence that the second sequence deriving means obtains user input and the first sequence deriving means obtain the mode of the search sequence that the user imports, for simplicity's sake, thus do not repeat them here, and comprise therewith by reference.
Second inquiry unit carries out matching inquiry according to described search sequence in described nearly adopted sequence library, to obtain the target nearly adopted sequence cluster corresponding with described search sequence.At this, second inquiry unit obtains the mode of the target nearly adopted sequence cluster corresponding with described search sequence and first inquiry unit obtains and the mode of the nearly adopted sequence cluster of target that described search sequence is corresponding is same or similar, for simplicity's sake, so do not repeat them here, and comprise therewith by reference.
Second generator passes through such as dynamic web page techniques such as ASP, JSP or PHP, the perhaps communication mode of other agreements, as communication protocols such as http or https, with at least one offers described user as " EF Englishtown official website; global distinguished Expert English language training by qualified teachers expert " in the corresponding one group of preferred Search Results of the nearly adopted sequence cluster of described target, as this user's subscriber equipment, browse for the user.
Those skilled in the art will be understood that in specific embodiment the first sequence deriving means can be independently module of phase with the second sequence deriving means, also can integrate; First inquiry unit can be independently module of phase with second inquiry unit, also can integrate; First generator can be independently module of phase with second generator, also can integrate.
Fig. 8 illustrates the method flow diagram that is used for determining nearly adopted sequence cluster according to a further aspect of the present invention.
Particularly, in step S1, it is right to determine that equipment 1 obtains a plurality of nearly adopted sequences; In step S2, determine that equipment 1 definite described a plurality of nearly adopted sequences are to corresponding initial nearly adopted sequence cluster; In step S3, determine that equipment 1 according to the proper vector of sequence in the described initial nearly adopted sequence cluster, carries out clustering processing to the sequence in the described initial nearly adopted sequence cluster, to obtain one or more nearly adopted sequence clusters.At this, determine that equipment 1 includes but not limited to that the network equipment, subscriber equipment or the network equipment and subscriber equipment are by the mutually integrated equipment that constitutes of network.At this, the described network equipment includes but not limited to as network host, single network server, a plurality of webserver collection or based on the realizations such as set of computers of cloud computing; Perhaps realized by subscriber equipment.At this, cloud is by constituting based on a large amount of main frames of cloud computing (Cloud Computing) or the webserver, and wherein, cloud computing is a kind of of Distributed Calculation, a super virtual machine of being made up of the loosely-coupled computing machine collection of a group.At this, described subscriber equipment can be any electronic product that can carry out man-machine interaction by modes such as keyboard, mouse, touch pad, touch-screen or hand-written equipment with the user, for example computing machine, mobile phone, PDA, palm PC PPC or panel computer etc.Described network includes but not limited to internet, wide area network, Metropolitan Area Network (MAN), LAN (Local Area Network), VPN network, wireless self-organization network (Ad Hoc network) etc.Those skilled in the art will be understood that above-mentioned definite equipment 1 is only for giving an example; other network equipments existing or that may occur from now on or subscriber equipment are as applicable to the present invention; also should be included in the protection domain of the present invention, and be contained in this at this with way of reference.At this, the network equipment and subscriber equipment include a kind of can be according to the instruction of prior setting or storage, automatically carry out the electronic equipment of numerical evaluation and information processing, its hardware includes but not limited to microprocessor, special IC (ASIC), programmable gate array (FPGA), digital processing unit (DSP), embedded device etc.
Particularly, in step S1, determine that equipment 1 obtains a plurality of search daily records; Then, semantic analysis is carried out in these a plurality of search daily records handled, it is right to obtain a plurality of nearly adopted sequences.At this, described nearly adopted sequence to include but not limited to following at least each: 1) title is different but synonym search sequence equivalent in meaning that express is right, as " Expert English language training by qualified teachers " and " English training "; 2) the near adopted search sequence of similar import is right, as " Expert English language training by qualified teachers " and " foreign language training ".Those skilled in the art will be understood that above-mentioned relevant inquiring sequence only for giving an example, and other nearly adopted sequences existing or that may occur from now on also should be included in the protection domain of the present invention, and be contained in this at this with way of reference as applicable to the present invention.At this, in step S1, determine equipment 1 obtain the right mode of described a plurality of nearly adopted sequences include but not limited to following at least each:
1) in step S1, determine the application programming interfaces (API) that obtain the search daily record that equipment 1 at first provides by third party's equipment such as search engine, browsers, obtain a plurality of search daily records; Then, semantic analysis is carried out in these a plurality of search daily records handled, it is right to obtain a plurality of nearly adopted sequences.For example, in step S1, determine that equipment 1 by the application programming interfaces (API) that obtain the search daily record that provide of search engine, gets access to a plurality of search daily records, as in certain period, search that the user submits to has comprised which keyword, user have clicked the Search Results which returns etc.; Then, in step S1, determine that equipment 1 carries out semantic analysis to the search sequence in these search daily records and handles, it is right to obtain a plurality of nearly adopted sequences, right as the synonym sequence of being made up of as " Expert English language training by qualified teachers ", " English training ", " Expert English language training by qualified teachers ", " education on foreign language " etc. the keyword that belongs to the synonym near synonym with keyword " Expert English language training by qualified teachers ".
2) in step S1, determine the application programming interfaces (API) that obtain the search daily record that equipment 1 at first provides by third party's equipment such as search engine, browsers, obtain a plurality of search daily records; Then, in step S1, determine that equipment 1 obtains one or more searching record again from described a plurality of search daily records, wherein, described searching record comprises corresponding search sequence and Search Results; Then, in step S1, determine equipment 1 again according to described one or more searching record, it is right to obtain a plurality of nearly adopted sequences.At this, in step S1, determine that equipment 1 is according to described one or more searching record, obtaining the right mode of a plurality of nearly adopted sequences includes but not limited to: i) according to the Search Results of described searching record correspondence, by to the Search Results of described searching record correspondence such as the summary texts of Search Results correspondence, the title link text, page body matter etc. carries out semantic analysis, to the processing of classifying of described one or more searching record, right to obtain described a plurality of nearly adopted sequence, wherein, described a plurality of nearly adopted sequence is to comprising the search sequence that belongs to of a sort searching record.For example, suppose in step S1, determine that equipment 1 gets access to a plurality of search sequence that record in the search daily record, and the Search Results of each search sequence correspondence be following searching record I to VII:
I " Expert English language training by qualified teachers ":
" EF Englishtown official website, global distinguished Expert English language training by qualified teachers expert "
" Expert English language training by qualified teachers-Wei Bo English allows study English and becomes so simple! "
II " Expert English language training by qualified teachers ":
" the hot luxurious most solemn of ceremonies on Christmas is namely enjoyed in the Expert English language training by qualified teachers registration "
" EF Englishtown official website, global distinguished Expert English language training by qualified teachers expert "
" New Orient IELTS training "
III " English training ":
" Beijing Expert English language training by qualified teachers Wei Bo English-we are absorbed in adult's Expert English language training by qualified teachers! (official website) "
" the Expert English language training by qualified teachers Beijing IELTS training of New Orient, Beijing is entrusted training Beijing to prepare for the postgraduate qualifying examination to train and is gone abroad ... "
IV " fresh flower ":
" 3 hours at first Chinese fresh flower nets of fresh flower of fresh flower! "
" warm fresh flower net fresh flower "
V " fresh flower express delivery ":
" fresh flower, I only choose state's fresh flower express delivery net! 100% quality guarantee "
" send and take fresh flower express delivery fresh flower net everyday "
VI " dangerous forest thoughts ":
" piggy diary: " dangerous forest " thoughts-taste conversation-literature and art-Sohu's circle "
" [new information] reads " dangerous forest " thoughts-lovely piggy-Sohu's blog "
" " dangerous forest "-reaction to an article-NetCash chess/card game is downloaded "
VII " dangerous forest thoughts ":
" piggy diary: " dangerous forest " thoughts-taste conversation-literature and art-Sohu's circle "
" yellow quiet firm five (5) _ Baidu libraries of dangerous forest reaction to an article "
" [new information] reads " dangerous forest " thoughts-lovely piggy-Sohu's blog "
" " dangerous forest "-reaction to an article-NetCash chess/card game is downloaded ",
Then in step S1, determine that equipment 1 is by carrying out semantic analysis such as Search Results such as the corresponding title link text of Search Results to searching record I to V correspondence, to the processing of classifying of the Search Results of searching record I to VII correspondence, obtaining the classification of searching record I to VII: 1.. searching record I to III is relevant, and it is classified as a class; 2.. searching record IV is relevant with V, and it is classified as another kind of; 3.. searching record VI is relevant with VII, and it is classified as a class; Then, in step S1, determine that equipment 1 is according to the searching record classification that obtains, the search sequence that will belong to of a sort searching record is right as nearly adopted sequence, right as obtaining a plurality of near adopted sequence corresponding with searching record I to III, as pairs1 " Expert English language training by qualified teachers " and " English training ", as pairs2 " Expert English language training by qualified teachers " and " Expert English language training by qualified teachers ", the near adopted sequence corresponding with searching record IV and V is to pairs3, as " fresh flower " and fresh flower express delivery "; the near adopted sequence corresponding with searching record VI and VII is to pairs4, as " dangerous forest thoughts " and " dangerous forest thoughts ".
Ii) to the processing of classifying of the search sequence of described one or more searching record correspondences, right to obtain described a plurality of nearly adopted sequence, wherein, described a plurality of nearly adopted sequences are to comprising the search sequence that belongs to of a sort searching record.For example, connect example, in step S1, determine that equipment 1 is by semantic analysis, to the processing of classifying of the search sequence of its searching record I to V correspondence of obtaining, obtain one or more synonym sequence clusters, wherein, described a plurality of nearly adopted sequence is to comprising the search sequence that belongs to of a sort searching record, as in step S1, it is right to determine that equipment 1 can obtain a plurality of near adopted sequence corresponding with searching record I to III, as pairs1 " Expert English language training by qualified teachers " and " English training ", as pairs2 " Expert English language training by qualified teachers " and " Expert English language training by qualified teachers ", the near adopted sequence corresponding with searching record IV and V is to pairs3, as " fresh flower " and " fresh flower express delivery ", the near adopted sequence corresponding with searching record VI and VII is to pairs4, as " dangerous forest thoughts " and " dangerous forest thoughts ".
Iii) that the Search Results of search sequence in the described searching record and correspondence thereof is right as described nearly adopted sequence.For example, suppose in step S1, determine that equipment 1 gets access to that user A has clicked Search Results in a plurality of Search Results that are complementary with search sequence " Expert English language training by qualified teachers " as " Beijing Expert English language training by qualified teachers Wei Bo English-we are absorbed in adult's Expert English language training by qualified teachers in the search daily record! (official website) ", then in step S1, determine that " Beijing Expert English language training by qualified teachers Wei Bo English-we are absorbed in adult's Expert English language training by qualified teachers to equipment 1 with Search Results! (official website) " corresponding title is as sequence, and to constitute described nearly adopted sequence right with sequence " Expert English language training by qualified teachers ".
Iv) the different search sequence of correspondence are right as described nearly adopted sequence as a result with same search in the described searching record.For example, suppose in step S1, determine that equipment 1 gets access to that user A has clicked Search Results in a plurality of Search Results that are complementary with search sequence " Expert English language training by qualified teachers " as " Beijing Expert English language training by qualified teachers Wei Bo English-we are absorbed in adult's Expert English language training by qualified teachers in the search daily record! (official website) "; and user B is according to search sequence " foreign language training " when searching for, and also clicked Search Results in a plurality of Search Results that search sequence " foreign language training " is complementary as " Beijing Expert English language training by qualified teachers Wei Bo English-we are absorbed in adult's Expert English language training by qualified teachers! (official website) ", then in step S1, it is right to determine that equipment 1 constitutes described nearly adopted sequence with sequence " Expert English language training by qualified teachers " and " foreign language training ".
3) right according to a plurality of nearly adopted sequences of candidate that marked nearly justice relation, to the degree of correlation information between two sequences that comprise, to carrying out Screening Treatment, right to obtain described nearly adopted sequence to the nearly adopted sequence of described a plurality of candidates in conjunction with the nearly adopted sequence of described candidate.Particularly, in step S1, determine the degree of correlation information between at first definite two sequences of the nearly adopted sequence centering of described candidate of equipment 1, as degree of correlation information as described in determining by the text matches degree of two sequences, perhaps, by the frequency information that is associated of two sequences described in the search daily record, the number of times that occurs corresponding to same Search Results as these two sequences etc.; Then, in step S1, determine equipment 1 according to described degree of correlation information, to carrying out Screening Treatment, as the nearly adopted sequence deletion of the candidate who degree of correlation information is lower than predetermined threshold, right to obtain described nearly adopted sequence to the nearly adopted sequence of described a plurality of candidates.For example, suppose in step S1 that a plurality of nearly adopted sequences of candidate that marked nearly justice relation of determining that equipment 1 gets access to are to as follows:
Pairs1 " Expert English language training by qualified teachers " and " English training "
Pairs2 " Expert English language training by qualified teachers " and " Expert English language training by qualified teachers "
Pairs3 " fresh flower " and " fresh flower express delivery "
Pairs4 " dangerous forest thoughts " and " dangerous forest thoughts "
Then in step S1, determine the degree of correlation information between at first definite two sequences of the nearly adopted sequence centering of described candidate of equipment 1, as degree of correlation information as described in determining by the text matches degree of two sequences, the degree of correlation information that obtains between two sequences of the nearly adopted sequence centering of above-mentioned candidate is respectively 0.75,1,0.5,1, then in step S1, determine that equipment 1 is according to described degree of correlation information, to the nearly adopted sequence of described a plurality of candidates to carrying out Screening Treatment, as degree of correlation information being lower than predetermined threshold as the nearly adopted sequence pairs3 deletion of 0.7 candidate, it is right to obtain described nearly adopted sequence, comprise pairs1 " Expert English language training by qualified teachers " and " English training ", pairs2 " Expert English language training by qualified teachers " and " Expert English language training by qualified teachers ", pairs4 " dangerous forest thoughts " and " dangerous forest thoughts ".
Preferably, in step S1, determine that equipment 1 can be at first according to a plurality of search daily records, it is right to obtain a plurality of sequence results; Then, according to the related information between a plurality of Search Results of described a plurality of sequence results centerings, right to filtering out a plurality of nearly adopted sequences the included sequence from described a plurality of sequence results.For example, in step S1, determine equipment 1 by the application programming interfaces (API) that obtain the search daily record that third party's equipment such as search engine, browser provide, it is right to get access to a plurality of sequence results that search records in the daily record, as above-mentioned searching record I to VII; Then, in step S1, determine that equipment 1 is by carrying out semantic analysis such as Search Results such as the corresponding title link text of Search Results to searching record I to VII correspondence, by the number of times of determining that identical or close text occurs in the corresponding title link text of Search Results, determine the degree of correlation between a plurality of Search Results among the searching record I to VII, to determine the related information between a plurality of Search Results among the searching record I to VII, thereby obtaining the classification of searching record I to VII: 1.. searching record I to III is relevant, and it is classified as a class; 2.. searching record IV is relevant with V, and it is classified as another kind of; 3.. searching record VI is relevant with VII, and it is classified as a class; Then, in step S1, determine that equipment 1 is according to the searching record classification that obtains, right to filtering out a plurality of nearly adopted sequences the included sequence from described a plurality of sequence results, right as nearly adopted sequence as belonging to the corresponding search sequence of of a sort Search Results, right as obtaining a plurality of near adopted sequence corresponding with searching record I to III, as pairs1 " Expert English language training by qualified teachers " and " English training ", as pairs2 " Expert English language training by qualified teachers " and " Expert English language training by qualified teachers ", the near adopted sequence corresponding with searching record IV and V is to pairs3, as " fresh flower " and " fresh flower express delivery ", the near adopted sequence corresponding with searching record VI and VII is to pairs4, as " dangerous forest thoughts " and " dangerous forest thoughts ".
Those skilled in the art will be understood that the above-mentioned right mode of a plurality of nearly adopted sequences of obtaining is only for giving an example; other existing or right modes of a plurality of nearly adopted sequences of obtaining that may occur from now on are as applicable to the present invention; also should be included in the protection domain of the present invention, and be contained in this at this with way of reference.
In step S2, determine equipment 1 according to described a plurality of nearly adopted sequences between related information, as semantic feature of nearly adopted sequence etc., determine that described a plurality of nearly adopted sequence is to corresponding initial nearly adopted sequence cluster, as the same or analogous nearly adopted sequence of semantic feature is combined, obtain described initial nearly adopted sequence cluster.For example, connect example, in step S2, determine that equipment 1 carries out semantic analysis to the sequence of its described a plurality of nearly adopted sequence centerings that obtain in step S1, obtain nearly adopted sequence semantic same or similar to pair1 and the corresponding sequence of pair2, then in step S2, determine equipment 1 nearly adopted sequence pair1 and the corresponding sequence of pair2 are merged, obtain initial nearly adopted sequence cluster cluster1, it comprises " Expert English language training by qualified teachers ", " Expert English language training by qualified teachers ", " English training ", in like manner, in step S2, determine that equipment 1 also can obtain initial nearly adopted sequence cluster cluster2, it comprises " fresh flower " and fresh flower express delivery "; initial nearly adopted sequence cluster cluster3, it comprises " dangerous forest thoughts " and " dangerous forest thoughts ".
Those skilled in the art will be understood that the mode of above-mentioned definite initial nearly adopted sequence cluster is only for giving an example; other existing or modes of determining initial nearly adopted sequence cluster that may occur from now on are as applicable to the present invention; also should be included in the protection domain of the present invention, and be contained in this at this with way of reference.
In step S3, determine that equipment 1 according to the proper vector of sequence in the described initial nearly adopted sequence cluster, carries out clustering processing to the sequence in the described initial nearly adopted sequence cluster, to obtain one or more nearly adopted sequence clusters.Particularly, in step S3, determine the proper vector of sequence in the at first definite described initial nearly adopted sequence cluster of equipment 1; Then, again according to described proper vector, the sequence in the described initial nearly adopted sequence cluster is carried out clustering processing, to obtain one or more nearly adopted sequence clusters.At this, described proper vector includes but not limited to following each characteristic component at least: 1. .X characteristic component: the vector that the sequence semantic feature information that is obtained after word segmentation processing by described sequence constitutes, the vector that constitutes of the word bag that after word segmentation processing, obtains of sequence as described, as for sequence query1 " Expert English language training by qualified teachers ", obtain " Expert English language training by qualified teachers " behind the participle, corresponding vector can be expressed as { x1: English, x2: training }, wherein, the vectorial coefficient of component xi correspondence is its TFIDF (word frequency-anti-document frequency, term frequency-inverse document frequency) value; For another example, for sequence query2 " ask way, egg menu, the homely egg of egg how to do, menu complete works ", obtain " asking the way egg menu daily life of a family egg of egg how to do the menu complete works " behind the participle, remove stop words, grammer etc., corresponding vector can be expressed as { x1: egg, x2: way, x3: menu, x4: the daily life of a family, x5: complete works }, wherein, the vectorial coefficient of component xi correspondence is its TFIDF value.At this, be that example describes with word " egg ": can obtain the DF value to webpage in enormous quantities (as N piece of writing webpage) with carrying out statistical approximation, for example if word " egg " appears in 10000 pieces of webpages, then its DF value is 10000, and occurred 3 times in the word bag of word " egg " behind participle, then the word frequency in the word bag of word " egg " behind participle is that the TF value is 3/11, thereby the TFIDF value of the correspondence of word " egg " is (3/11) * log (N/10000); 2. .Y characteristic component: the vector that the word bag that is undertaken obtaining behind the participle by title and/or the summary info of the corresponding top n Search Results of described sequence correspondence constitutes.At this, the vectorial coefficient of Y characteristic component correspondence can comprise the historical total click information of the corresponding Search Results of described sequence, average click information etc.At this, determine that the mode of the vector that the mode of vector of Y characteristic component correspondence is corresponding with definite X characteristic component is same or similar, for simplicity's sake, thus do not repeat them here, and comprise by reference therewith; 3. .Z characteristic component: clicked the vector that the historical click information of the Search Results of described sequence correspondence constitutes by the user.At this, the vectorial coefficient of Z characteristic component correspondence can comprise the historical total click information of the corresponding Search Results of described sequence, average click information etc.For example, if for query1, in search daily record record, the user clicks Search Results url11, the url12 of query1 correspondence, the number of clicks of url13 correspondence is respectively 3 times, 4 times, 1 time, and then { url3} represents query1 to availability vector for url1, url2.At this, described proper vector include but not limited to following at least each: 1) directly formed by described characteristic component; 2) according to the weight information of described characteristic component correspondence, weighting obtains described proper vector.Those skilled in the art will be understood that above-mentioned proper vector and characteristic component are only for giving an example; other proper vectors existing or that may occur from now on or characteristic component are as applicable to the present invention; also should be included in the protection domain of the present invention, and be contained in this at this with way of reference.At this, in step S3, determine equipment 1 determine the mode of the proper vector of sequence in the described initial nearly adopted sequence cluster include but not limited to following at least each:
1) according to default described characteristic component, directly form described proper vector by described characteristic component, proper vector can be expressed as described
Figure BDA00002982572700311
Suppose in step S2, determine that the sequence " Expert English language training by qualified teachers " among the initial near adopted sequence cluster cluster1 that equipment 1 determines obtains " Expert English language training by qualified teachers ", then characteristic component behind participle
Figure BDA00002982572700312
Can be expressed as { x1: English, x2: training }, be respectively 0.9,0.9 as if the TFIDF value of x1, x2 correspondence, then characteristic component
Figure BDA00002982572700313
For characteristic component
Figure BDA00002982572700321
The click total degree of hypothetical sequence " Expert English language training by qualified teachers " Search Results url1 " EF Englishtown official website; global distinguished Expert English language training by qualified teachers expert " in nearly 200 days search daily record at most as be 10,000 times, behind participle, obtain " the distinguished Expert English language training by qualified teachers expert in the whole world, EF Englishtown official website ", remove stop words, grammer etc., characteristic component
Figure BDA00002982572700322
Can be expressed as { y1: English inspires confidence in, y2: education, y3: English, y4: training, y5: expert }, be respectively 0.7,0.77,0.9,0.9,0.3 as if the TFIDF value of y1, y2, y3, y4, y5 correspondence, then characteristic component Y → = 0.7 y → 1 + 0.77 y → 2 + 0.9 y → 3 + 0.9 y → 4 + 0.3 y → 5 , For characteristic component If sequence " Expert English language training by qualified teachers " in nearly 200 days search daily record Search Results url1 " EF Englishtown official website, global distinguished Expert English language training by qualified teachers expert ", url2 " Expert English language training by qualified teachers-Wei Bo English allows study English and becomes so simple! " corresponding number of clicks is respectively 4 times, 3 times, characteristic component then
Figure BDA00002982572700325
Then in step S3, determine that the proper vector of equipment 1 definite sequence " Expert English language training by qualified teachers " is T → = ( 0.9 x → 1 + 0.9 x → 2 ) + ( 0.7 y → 1 + 0.77 y → 2 + 0.9 y → 3 + 0.9 y → 4 + 0.3 y → 5 ) + ( 4 url → 1 + 3 url → 2 ) .
2) according to default described characteristic component, based on the corresponding weight information of described characteristic component, described proper vector is determined in weighting.For example, also connect example, suppose characteristic component
Figure BDA00002982572700328
With
Figure BDA00002982572700329
Corresponding weight is respectively 0.4,0.2, then in step S3, determines that the proper vector of equipment 1 definite sequence " Expert English language training by qualified teachers " is T → = 0.4 * ( 0.9 x → 1 + 0.9 x → 2 ) + 0.2 * ( 0.7 y → 1 + 0.77 y → 2 + 0.9 y → 3 + 0.9 y → 4 + 0.3 y → 5 ) + ( 4 url → 1 + 3 url → 2 ) .
Those skilled in the art will be understood that the mode of the proper vector of sequence in above-mentioned definite described initial nearly adopted sequence cluster is only for giving an example; the mode of the proper vector of sequence is as applicable to the present invention in other definite described initial nearly adopted sequence clusters existing or that may occur from now on; also should be included in the protection domain of the present invention, and be contained in this at this with way of reference.
Then, in step S3, determine equipment 1 again according to described proper vector, the sequence in the described initial nearly adopted sequence cluster is carried out clustering processing, to obtain one or more nearly adopted sequence clusters.Particularly, in step S3, determine that equipment 1 can be according to the included angle cosine value between each characteristic component of the proper vector correspondence of sequence in the described initial nearly adopted sequence cluster; Then according to the included angle cosine value between this each characteristic component, weight information in conjunction with each characteristic component, the included angle cosine value between the proper vector of sequence in the described initial nearly adopted sequence cluster is determined in weighting, to determine the similarity of the sequence in the described initial nearly adopted sequence cluster; Then, in step S3, determine equipment 1 according to described similarity, the sequence in the described initial nearly adopted sequence cluster is carried out clustering processing, to obtain one or more nearly adopted sequence clusters.For example, suppose in step S3, determine that sequence " Expert English language training by qualified teachers ", " Expert English language training by qualified teachers " among the initial near adopted sequence cluster cluster1 that equipment 1 determines, the proper vector of " the English training " are respectively: T → 1 = X → 1 + Y → 1 + Z → 1 , T → 2 = X → 2 + Y → 2 + Z → 2 , T → 3 = X → 3 + Y → 3 + Z → 3 , In step S3, determine that equipment 1 is at first according to vector
Figure BDA00002982572700334
Figure BDA00002982572700335
With
Figure BDA00002982572700336
Included angle cosine value between each corresponding characteristic component as: for
Figure BDA00002982572700337
With Between each characteristic component: as for
Figure BDA00002982572700339
Characteristic component calculates
Figure BDA000029825727003310
For
Figure BDA000029825727003311
Characteristic component calculates sim 2 = cos ( Y → 1 , Y → 2 ) = 0.9 , For
Figure BDA000029825727003313
Characteristic component calculates sim 3 = cos ( Z → 1 , Z → 2 ) = 0 . 6 , Then in step S3, determine that equipment 1 can obtain
Figure BDA000029825727003315
With Between similarity as similarity ( T → 1 , T → 2 ) = a * sim 1 + b * sim 2 + c * sim 3 , Wherein, a, b, c is the weight information of character pair component, satisfies a+b+c=1, at this, a, b, c numerical information can determine by machine learning, also can comprise predetermined value, if determine a=0.5, b=0.3, c=0.2 then in step S3, determines that equipment 1 can calculate
Figure BDA000029825727003318
With
Figure BDA000029825727003319
Between similarity be similarity ( T → 1 , T → 2 ) = a * sim 1 + b * sim 2 + c * sim 3 = 0.5 * 0.9 + 0.3 * 0.9 + 0.2 * 0.6 = 0.84 , As 0.8, then in step S3, determine that equipment 1 is with proper vector greater than predetermined threshold
Figure BDA000029825727003321
With
Figure BDA000029825727003322
Corresponding respectively sequence " Expert English language training by qualified teachers " and " Expert English language training by qualified teachers " are classified as same nearly adopted sequence cluster such as synonyms-cluster1, similarly, in step S3, determine that equipment 1 can calculate
Figure BDA000029825727003323
With
Figure BDA000029825727003324
Between similarity be
Figure BDA000029825727003325
As 0.8, then sequence cluster deriving means 13 is with proper vector less than predetermined threshold
Figure BDA000029825727003326
Corresponding sequence " English training " is classified as another nearly adopted sequence cluster such as synonyms-cluster2.
Those skilled in the art will be understood that the mode of the weight information of above-mentioned definite each characteristic component only is for example; the mode of other existing or weight informations of determining each characteristic component that may occur from now on is as applicable to the present invention; also should be included in the protection domain of the present invention, and be contained in this at this with way of reference.
Preferably, in step S3, determine equipment 1 can be at first according to the proper vector of sequence in the described initial nearly adopted sequence cluster, determine the similarity information between the sequence in the described initial nearly adopted sequence cluster; Then, according to described similarity information, the sequence in the described initial nearly adopted sequence cluster is carried out clustering processing, to obtain one or more nearly adopted sequence clusters; Wherein, described proper vector comprises following each characteristic component at least:
The corresponding sequence semantic feature of-described sequence information;
The historical click information of the corresponding Search Results of-described sequence;
The corresponding Search Results summary info of-described sequence.
For example, suppose in step S3, determine that sequence " Expert English language training by qualified teachers ", " Expert English language training by qualified teachers " among the initial near adopted sequence cluster cluster1 that equipment 1 determines, the proper vector of " the English training " are respectively T → 1 = X → 1 + Y → 1 , T → 2 = X → 2 + Y → 2 , T → 3 = X → 3 + Y → 3 , In step S3, determine that equipment 1 is at first according to vector
Figure BDA00002982572700344
Figure BDA00002982572700345
With
Figure BDA00002982572700346
Included angle cosine value between each corresponding characteristic component as: for
Figure BDA00002982572700347
With
Figure BDA00002982572700348
Between each characteristic component: as for
Figure BDA00002982572700349
Characteristic component calculates
Figure BDA000029825727003410
For
Figure BDA000029825727003411
Characteristic component calculates
Figure BDA000029825727003412
Then in step S3, determine that equipment 1 can obtain according to this cosine value
Figure BDA000029825727003413
With Between similarity as similarity ( T → 1 , T → 2 ) = a * sim 1 , + b * sim 2 , , If determine a=0.5, b=0.5 then in step S3, determines that equipment 1 can calculate With
Figure BDA000029825727003417
Between similarity be similarity ( T → 1 , T → 2 ) = a * sim 1 , + b * sim 2 , = 0.5 * 1 + 0.5 * 1 = 1 , As 0.8, then in step S3, determine equipment 1 definite proper vector greater than predetermined threshold
Figure BDA000029825727003419
With
Figure BDA000029825727003420
Corresponding respectively sequence " Expert English language training by qualified teachers " and " Expert English language training by qualified teachers " belong to same nearly adopted sequence cluster such as synonyms-cluster1, then, in step S3, determine that equipment 1 is in definite proper vector
Figure BDA000029825727003421
Whether during nearly adopted sequence cluster synonyms-cluster1, similarly, in step S3, determine that equipment 1 can calculate With
Figure BDA000029825727003423
Between the included angle cosine value determine
Figure BDA000029825727003424
With Similarity, as obtain
Figure BDA000029825727003426
As 0.8, then in step S3, determine that equipment 1 is with proper vector less than predetermined threshold
Figure BDA000029825727003427
Corresponding sequence " English training " is classified as another nearly adopted sequence cluster such as synonyms-cluster2.
Those skilled in the art will be understood that the mode of the similarity information between the sequence in above-mentioned definite described initial nearly adopted sequence cluster is only for giving an example; the mode of the similarity information in other definite described initial nearly adopted sequence clusters existing or that may occur from now between the sequence is as applicable to the present invention; also should be included in the protection domain of the present invention, and be contained in this at this with way of reference.
Those skilled in the art will be understood that the mode of the described nearly adopted sequence cluster of above-mentioned acquisition is only for giving an example; the mode of the described nearly adopted sequence cluster of other acquisitions existing or that may occur from now on is as applicable to the present invention; also should be included in the protection domain of the present invention, and be contained in this at this with way of reference.
Determine between each step of equipment 1 it is constant work.Particularly, in step S1, it is right to determine that equipment 1 continues to obtain a plurality of nearly adopted sequences; In step S2, determine that equipment 1 continues to determine that described a plurality of nearly adopted sequences are to corresponding initial nearly adopted sequence cluster; In step S3, determine that equipment 1 continues the proper vector according to sequence in the described initial nearly adopted sequence cluster, carries out clustering processing to the sequence in the described initial nearly adopted sequence cluster, to obtain one or more nearly adopted sequence clusters.At this, those skilled in the art be to be understood that " continuing " refer to determine each step of equipment 1 constantly carry out respectively a plurality of nearly adopted sequences right obtain, the determining and the acquisition of nearly adopted sequence cluster of initial nearly adopted sequence cluster, until determining that equipment 1 stops to obtain of a plurality of relevant inquiring sequences and corresponding a plurality of Search Results in ultra-long time.
Preferably, step S2 comprises step S21 (not shown) and step S22 (not shown).Particularly, in step S21, determine that equipment 1 according to the corresponding label of described a plurality of nearly adopted sequence centering sequences, based on the label propagation algorithm, determines that described a plurality of nearly adopted sequence is to corresponding intensive sequence cluster; In step S22, determine equipment 1 according to described intensive sequence cluster, the sequence of described a plurality of nearly adopted sequence centerings is carried out sequence merge processing, to obtain described initial nearly adopted sequence cluster.
Particularly, in step S21, determine that equipment 1 is the summit with the sequence of its described a plurality of nearly adopted sequence centerings of obtaining at first in step S1, and for distributing unique tags in each summit, it is corresponding unique tags of each sequence of described a plurality of nearly adopted sequence centerings, and closing with the nearly justice between described a plurality of nearly adopted sequence centering sequences is the limit, constitutes the sequence connection layout; Then, in step S21, determine that equipment 1 is based on the label propagation algorithm, determine the label on each summit, as being determined by the maximum label of the frequency of adjacency, and iteration successively, cluster is gathered on the summit of final same label, obtains described a plurality of nearly adopted sequence to corresponding intensive sequence cluster.For example, suppose in step S1, determine described a plurality of near adopted sequence that equipment 1 obtains to comprising query1-query2, query1-query3, query1-query4, query2-query4, query5-query6, query6-query7, query5-query8, intensive determining unit is that the unique tags of sequence allocation of these a plurality of nearly adopted sequence centerings is as shown in table 2 below:
Sequence Corresponding label Sequence Corresponding label
query1 A1 query5 E1
query2 B1 query6 F1
query3 C1 query7 G1
query4 D1 query8 H1
Table 2
Then in step S21, determine these a plurality of nearly adopted sequences that equipment 1 obtains to corresponding sequence connection diagram as shown in Figure 2, wherein, solid line is represented sequence to having nearly justice relation among the figure, dotted line represents that sequence is not to having nearly justice relation; Then, in step S21, determine that equipment 1 is based on the label propagation algorithm, determine the label on each summit shown in Fig. 2, as being determined by the maximum label of the frequency of summit adjacency, as being example with summit A1, the label of the adjacent node of summit A is B1, C1, D1, suppose B1, C1, the frequency of D1 is 1, then in step S21, determine that equipment 1 determines label such as B1 that the label of summit A1 is summit A1 in abutting connection with the label of label sequence number maximum such as D1 or summit A1 in abutting connection with the label sequence number minimum, similarly, in step S21, determine equipment 1 definite summit B1 successively, C1, D1, E1, F1, G1, the label of H1 correspondence is respectively A1, A1, A1, F1, G1, F1, E1, obtain the new sequence connection diagram corresponding with Fig. 2, as shown in Figure 3, then, in step S21, determine that equipment 1 gathers into cluster with the summit of same label, obtain described a plurality of nearly adopted sequence to corresponding intensive sequence cluster, as initial label B 1, C1, the new label of D1 is A1, initial labels E1, the new label of G1 is F1, then in step S21, determine that equipment 1 is with initial labels B1, C1, cluster is gathered on the summit of D1 correspondence, obtain intensive sequence cluster such as intensive-cluster1, it comprises initial labels B1, C1, the corresponding sequence query2 of D1, query3 and query4, in like manner, in step S21, determine that equipment 1 is with initial labels E1, cluster is gathered on the summit of G1 correspondence, obtain intensive sequence cluster such as intensive-cluster2, it comprises initial labels E1, the corresponding sequence query5 of G1 and query7.
Those skilled in the art will be understood that the mode of above-mentioned definite intensive sequence cluster is only for giving an example; the mode of other definite intensive sequence clusters existing or that may occur from now on is as applicable to the present invention; also should be included in the protection domain of the present invention, and be contained in this at this with way of reference.
Then, in step S22, determine equipment 1 according to described intensive sequence cluster, the sequence of described a plurality of nearly adopted sequence centerings is carried out sequence merge processing, to obtain described initial nearly adopted sequence cluster.Particularly, in step S22, determine that equipment 1 at first is considered as a summit with described intensive sequence cluster, determine the node to be combined of the corresponding node of sequence of described a plurality of nearly adopted sequence centerings, as with as described in all nodes of linking to each other of the corresponding node of sequence of a plurality of nearly adopted sequence centerings, the node that the node number of degrees are the highest is as described node to be combined, perhaps, when with all nodes that the corresponding node of the sequence of described a plurality of nearly adopted sequence centerings links to each other in when a plurality of node with identical high node number of degrees occurring, then in these a plurality of nodes with identical high node number of degrees, randomly draw a node as described node to be combined, at this, the described node number of degrees refer to the number of the node that node connects; Then, the sequence merge cells is based on the predetermined rule that merges, the sequence of described a plurality of nearly adopted sequence centerings is carried out sequence merge processing, to obtain described initial nearly adopted sequence cluster, at this, described predetermined merging rule include but not limited to following at least each: 1) node that will have an identical node to be combined merges; 2) node of node to be combined is combined each other.
For example, connect example, in step S22, determine that equipment 1 is considered as a summit with its intensive sequence cluster of determining in step S21, obtain intensive sequence cluster be considered as node summit after corresponding with Fig. 3 and merge synoptic diagram, as shown in Figure 4, wherein, the initial labels B1 that comprises of intensive sequence cluster intensive-cluster1, C1, D1 is as a summit, and its label is A1, the initial labels E1 that intensive sequence cluster intensive-cluster2 comprises, G1 is as a summit, its label is F1, in Fig. 4, and node { E1, the node to be combined of G1} is identical, node is to { A1, B1} be node to be combined each other, then in step S22, determine that equipment 1 is with node { E1, G1} merges, with node to A1, B1} merges, and the set { E1 after will merging, G1, F1} and A1, B1} is considered as node, rebuild the sequence connection layout, continue XM and merge processing, boundless until any two nodes, final, in step S22, determine that equipment 1 will gather that { sequence of F1} correspondence is classified as same nearly adopted sequence cluster for E1, G1, to gather that { A1, the sequence of B1} correspondence is classified as another nearly adopted sequence cluster.
For another example, suppose in step S22, determine that equipment 1 merges synoptic diagram as shown in Figure 5 with the node that its intensive sequence cluster of determining is considered as obtaining behind the summit in step S21, { node to be combined of D} is identical for A, B for node, node is to { E, F} be node to be combined each other, then in step S22, determine that { D} merges equipment 1 for A, B with node, with node to { E, F} merge, and the set { A after will merging, B, D, C} and { E, F} is considered as node, rebuilds the sequence connection layout, continues XM and merges, boundless until any two nodes, finally, in step S22, determine that equipment 1 will gather { A, B, D, the sequence of C} correspondence is classified as same nearly adopted sequence cluster, will gather { E, the sequence of F} correspondence is classified as another nearly adopted sequence cluster, as shown in Figure 6.
Those skilled in the art will be understood that it only is for example that above-mentioned sequence to described a plurality of nearly adopted sequence centerings is carried out the mode of sequence merging processing; other sequences to described a plurality of nearly adopted sequence centerings existing or that may occur are from now on carried out mode that sequence merge to handle as applicable to the present invention; also should be included in the protection domain of the present invention, and be contained in this at this with way of reference.
(with reference to figure 8) in a preferred embodiment determines that equipment 1 comprises step S1, step S2 and step S3, and wherein, step S3 comprises step S31 (not shown) and step S32 (not shown).Be described below with reference to the preferred embodiment of Fig. 8: particularly, in step S1, it is right to determine that equipment 1 obtains a plurality of nearly adopted sequences; In step S2, determine that equipment 1 definite described a plurality of nearly adopted sequences are to corresponding initial nearly adopted sequence cluster; In step S31, determine that equipment 1 according to the proper vector of sequence in the described initial nearly adopted sequence cluster, carries out clustering processing to the sequence in the described initial nearly adopted sequence cluster, to obtain the nearly adopted sequence cluster of one or more candidates; In step S32, determine that the nearly adopted sequence cluster of 1 couple of described candidate of equipment carries out denoising, to obtain described nearly adopted sequence cluster.At this, step S1 and step S2 with embodiment illustrated in fig. 8 in the content of corresponding step same or similar, for simplicity's sake, so do not repeat them here, and comprise therewith by reference.
Particularly, in step S31, determine that equipment 1 according to the proper vector of sequence in the described initial nearly adopted sequence cluster, carries out clustering processing to the sequence in the described initial nearly adopted sequence cluster, to obtain the nearly adopted sequence cluster of one or more candidates.At this, in step S31, determine that equipment 1 obtains among the mode of the nearly adopted sequence cluster of described one or more candidates and Fig. 8 in step S3, the mode of determining the described one or more nearly adopted sequence clusters of equipment 1 acquisition is same or similar, for simplicity's sake, do not repeat them here, and comprise therewith by reference.
Then, in step S32, determine that the nearly adopted sequence cluster of 1 couple of described candidate of equipment carries out denoising, as remove redundant text, to obtain described nearly adopted sequence cluster.For example, suppose in step S31, determine that the nearly adopted sequence cluster candidate-cluster of candidate that equipment 1 obtains comprises that a plurality of sequences are as queryA: " asking the way of egg; the egg menu; how homely egg is done; the menu complete works ", queryB: " homely egg way ", queryC: " how doing homely egg dish ", then in step S32, determine that equipment 1 is by the sequence queryA to comprising in the nearly adopted sequence cluster of this candidate, the description text of queryB and queryC correspondence carries out semantic analysis, obtain the way that themes as egg of sequence correspondence in the nearly adopted sequence cluster of this candidate, and the text that comprises in the textual description of queryA correspondence " menu complete works " has deviated from this theme, then in step S32, determine that equipment 1 judgement text is " menu complete works " redundant text, and it is removed from the description text of queryA correspondence, obtain the near adopted sequence cluster of the nearly adopted sequence cluster correspondence of this candidate, it comprises sequence queryA: " asking the way of egg; the egg menu; how homely egg is done ", queryB: " homely egg way ", queryC: " how doing homely egg dish ".
Preferably, in step S32, determine that equipment 1 also can carry out denoising to the nearly adopted sequence cluster of described candidate, to obtain described nearly adopted sequence cluster according to the proper vector of sequence in the nearly adopted sequence cluster of described candidate and the similarity information of corresponding bunch of proper vector of the nearly adopted sequence cluster of this candidate.Particularly, in step S32, determine proper vector and corresponding bunch of proper vector of the nearly adopted sequence cluster of this candidate of sequence in the at first definite nearly adopted sequence cluster of described candidate of equipment 1; Then, determine the proper vector of sequence in the nearly adopted sequence cluster of described candidate and the similarity information of corresponding bunch of proper vector of the nearly adopted sequence cluster of this candidate again, as when as described in the nearly adopted sequence cluster of candidate sequence proper vector and as described in thick proper vector comprise when being the characteristic component of vectorial coefficient with textual description, according to the text matches degree between the vectorial coefficient of the proper vector of sequence in the nearly adopted sequence cluster of this candidate and corresponding bunch of proper vector of the nearly adopted sequence cluster of this candidate, determine described similarity information, perhaps, when the proper vector of sequence in the nearly adopted sequence cluster of described candidate and described thick proper vector do not comprise when being the characteristic component of vectorial coefficient with textual description, according to the proper vector of sequence in the nearly adopted sequence cluster of described candidate and the angle between corresponding bunch of proper vector of the nearly adopted sequence cluster of this candidate, determine described similarity information; Then, according to this similarity information, the nearly adopted sequence cluster of described candidate is carried out denoising, as with the sequence of similarity information less than predetermined threshold, from the nearly adopted sequence cluster of described candidate, delete, to obtain described nearly adopted sequence cluster.
Particularly, in step S32, determine proper vector and corresponding bunch of proper vector of the nearly adopted sequence cluster of this candidate of sequence in the at first definite nearly adopted sequence cluster of described candidate of equipment 1.Particularly, in step S32, determine the proper vector of sequence in the at first definite nearly adopted sequence cluster of described candidate of equipment 1; Then, in step S32, determine that equipment 1 is according to the proper vector of sequence in the nearly adopted sequence cluster of described candidate, determine corresponding bunch of proper vector of the nearly adopted sequence cluster of this candidate, as with as described in the nearly adopted sequence cluster of candidate the vectorial coefficient of each characteristic component of the proper vector of sequence mean value as described in the vectorial coefficient of bunch proper vector character pair component.At this, in step S32, determine that equipment 1 determines among the mode of the proper vector of sequence in the nearly adopted sequence cluster of described candidate and Fig. 8 in step S3, the mode of determining the proper vector of sequence in equipment 1 definite described initial nearly adopted sequence cluster is same or similar, for simplicity's sake, so do not repeat them here, and comprise therewith by reference.
Those skilled in the art will be understood that the mode of above-mentioned definite described bunch of proper vector is only for giving an example; the mode of other existing or definite described bunch proper vectors that may occur from now on is as applicable to the present invention; also should be included in the protection domain of the present invention, and be contained in this at this with way of reference.
Then, in step S32, determine that equipment 1 determines the proper vector of sequence in the nearly adopted sequence cluster of described candidate and the similarity information of corresponding bunch of proper vector of the nearly adopted sequence cluster of this candidate again.At this, in step S32, determine that equipment 1 determines among the mode of described similarity information and Fig. 8 in step S3, determine that equipment 1 determines that the mode of the similarity between the proper vector of sequence in the described initial nearly adopted sequence cluster is same or similar, for simplicity's sake, so do not repeat them here, and comprise therewith by reference.
Then, in step S32, determine equipment 1 according to this similarity information, the nearly adopted sequence cluster of described candidate is carried out denoising, as with the sequence of similarity information less than predetermined threshold, from the nearly adopted sequence cluster of described candidate, delete, to obtain described nearly adopted sequence cluster.For example, in step S32, determine that the proper vector of sequence queryA among equipment 1 definite nearly adopted sequence cluster candidate-cluster of candidate and the similarity information of corresponding bunch of proper vector of the nearly adopted sequence cluster candidate-cluster of candidate are 0.8,6, less than predetermined threshold 0.9, then in step S32, determine that equipment 1 deletes sequence queryA from the nearly adopted sequence cluster candidate-cluster of candidate, obtain described nearly adopted sequence cluster.
Those skilled in the art will be understood that the above-mentioned mode that the nearly adopted sequence cluster of described candidate is carried out denoising is only for for example; other existing or modes that the nearly adopted sequence cluster of described candidate is carried out denoising that may occur from now on are as applicable to the present invention; also should be included in the protection domain of the present invention, and be contained in this at this with way of reference.
Fig. 9 illustrates the method flow diagram that is used for determining nearly adopted sequence cluster in accordance with a preferred embodiment of the present invention.
Particularly, in step S1 ', it is right to determine that equipment 1 obtains a plurality of nearly adopted sequences; In step S2 ', determine that equipment 1 definite described a plurality of nearly adopted sequences are to corresponding initial nearly adopted sequence cluster; In step S3 ', determine that equipment 1 according to the proper vector of sequence in the described initial nearly adopted sequence cluster, carries out clustering processing to the sequence in the described initial nearly adopted sequence cluster, to obtain one or more nearly adopted sequence clusters; In step S4 ', determine equipment 1 according to described nearly adopted sequence cluster, set up or adopted sequence library more recently.At this, step S1 ', step S2 ' and step S3 ' with embodiment illustrated in fig. 8 in the content of corresponding step same or similar, for simplicity's sake, so do not repeat them here, and comprise therewith by reference.
Particularly, in step S4 ', determine equipment 1 according to described nearly adopted sequence cluster, set up or adopted sequence library more recently.For example, suppose in step S3 ', determine that near adopted sequence cluster such as synonyms-cluster1 that equipment 1 obtains comprise sequence " Expert English language training by qualified teachers " and " Expert English language training by qualified teachers ", then in step S4 ', determine that sequence " Expert English language training by qualified teachers " and " Expert English language training by qualified teachers " that equipment 1 will this nearly adopted sequence cluster synonyms-cluster1 comprises are stored in the nearly adopted sequence library, press certain way and upgrade this nearly adopted sequence library, as according to predetermined period, regularly renewal, immediately upgrade as described in nearly adopted sequence library.
Those skilled in the art will be understood that the mode of the described nearly adopted sequence library of above-mentioned renewal is only for giving an example; the mode of the described nearly adopted sequence library of other renewals existing or that may occur from now on is only as applicable to the present invention; also should be included in the protection domain of the present invention, and be contained in this at this with way of reference.
Preferably, determine that equipment 1 also comprises step S5 ' (not shown) and step S6 ' (not shown).Particularly, in step S5 ', determine that equipment 1 detects sequence in the described nearly adopted sequence cluster and whether is present in other nearly adopted sequence clusters in the described nearly adopted sequence library; If exist, in step S6 ', determine that 1 pair of this sequence of equipment goes redundant the processing, to upgrade described nearly adopted sequence library.
Particularly, in step S5 ', determine that equipment 1 detects sequence in the described nearly adopted sequence cluster and whether is present in other nearly adopted sequence clusters in the described nearly adopted sequence library.For example, suppose in step S3 ', determine that near adopted sequence cluster such as synonyms-cluster1 that equipment 1 obtains comprise sequence " Expert English language training by qualified teachers " and " Expert English language training by qualified teachers ", then in step S5 ', determine that equipment 1 compares by the sequence that the sequence that will comprise among its near adopted sequence cluster synonyms-cluster1 that obtains and other the nearly adopted sequence clusters in its described near adopted sequence library of setting up comprise in step S3 ', determine that sequence " Expert English language training by qualified teachers " that nearly adopted sequence cluster synonyms-cluster1 comprises and " Expert English language training by qualified teachers " are not present in other the nearly adopted sequence clusters in the described nearly adopted sequence library in step S4 '.
If exist, in step S6 ', determine that 1 pair of this sequence of equipment goes redundant the processing, to upgrade described nearly adopted sequence library.For example, connect example, suppose in step S5 ', determine to exist other nearly adopted sequence clusters such as synonyms-cluster1 ' also to comprise the sequence " Expert English language training by qualified teachers " that is present among the nearly adopted sequence cluster synonyms-cluster1 in the described nearly adopted sequence library of equipment 1 discovery, then in step S6 ', determine that 1 pair of this sequence of equipment " Expert English language training by qualified teachers " goes redundant the processing, as the degree of correlation according to the nearly adopted sequence cluster at this repeating sequences " Expert English language training by qualified teachers " and its place, this sequence " Expert English language training by qualified teachers " is retained in the highest near adopted sequence cluster of the degree of correlation, the appearance of deletion in other nearly adopted sequence clusters, so that it is present in some sequence clusters, thus adopted sequence library more recently.The invention enables the described near adopted sequence library after the renewal not have the sequence that belongs to different sequence clusters, improved the accuracy of nearly adopted sequence cluster.
Those skilled in the art will be understood that and above-mentioned sequence gone redundant the processing in the mode of upgrading described nearly adopted sequence library only for for example; other existing or may occur from now on sequence is gone redundant the processing in the mode of upgrading described nearly adopted sequence library as applicable to the present invention; also should be included in the protection domain of the present invention, and be contained in this at this with way of reference.
Preferably, determine that equipment 1 also comprises step S7 ' (not shown), step S8 ' (not shown) and step S9 ' (not shown).Particularly, in step S7 ', determine that equipment 1 obtains the search sequence of user's input; In step S8 ', determine equipment 1 according to described search sequence, in described nearly adopted sequence library, carry out matching inquiry, to determine the target nearly adopted sequence cluster corresponding with described search sequence; In step S9 ', determine that equipment 1 offers described user with at least one sequence in the nearly adopted sequence cluster of described target, with the recommended items as described search sequence.
Particularly, in step S7 ', determine equipment 1 by dynamic web page techniques such as ASP, JSP, perhaps the application programming interfaces that provide by search engine (API) obtain the search sequence of user's input.For example, if search subscriber A imports keyword " Expert English language training by qualified teachers " by its mobile device iphone in the search engine search column, press "enter" key", then in step S7 ', determine equipment 1 by dynamic web page techniques such as ASP, JSP, just obtain user A by the search sequence " Expert English language training by qualified teachers " of its mobile device iphone input.
In step S8 ', determine equipment 1 according to described search sequence, in described nearly adopted sequence library, carry out matching inquiry, to determine the target nearly adopted sequence cluster corresponding with described search sequence.For example, connect example, in step S8 ', determine that equipment 1 is according to its described search sequence that gets access in step S7 ', its in step S4 ', set up or the described near adopted sequence library that upgrades in carry out matching inquiry, obtain the target nearly adopted sequence cluster corresponding with described search sequence, as with the near adopted sequence cluster at search sequence " Expert English language training by qualified teachers " place as described in the nearly adopted sequence cluster of target, comprise nearly adopted sequences such as " Expert English language training by qualified teachers ", " Expert English language training by qualified teachers " as the near adopted sequence cluster at search sequence " Expert English language training by qualified teachers " place.
In step S9 ', determine that equipment 1 passes through such as dynamic web page techniques such as ASP, JSP or PHP, the perhaps communication mode of other agreements, as communication protocols such as http or https, at least one sequence in the nearly adopted sequence cluster of described target is offered described user, with the recommended items as described search sequence.For example, connect example, in step S9 ', determine the near adopted sequence " Expert English language training by qualified teachers " that comprises in the near adopted sequence cluster of equipment 1 with search sequence " Expert English language training by qualified teachers " place, " Expert English language training by qualified teachers " offers user A as recommended items, the confession user browses and selects, as when user A in search column during list entries " Expert English language training by qualified teachers ", in step S9 ', determine that at least one sequence in the nearly adopted sequence cluster of the target corresponding with this sequence " Expert English language training by qualified teachers " that equipment 1 determines first inquiry unit is prompted to user A as the recommended items of this sequence " Expert English language training by qualified teachers " with drop-down box form.
(with reference to figure 9) in a preferred embodiment determines that equipment 1 also comprises step S10 ' (not shown), step S11 ' (not shown) and step S12 ' (not shown).Be described below with reference to the preferred embodiment of Fig. 9: particularly, in step S1 ', it is right to determine that equipment 1 obtains a plurality of nearly adopted sequences; In step S2 ', determine that equipment 1 definite described a plurality of nearly adopted sequences are to corresponding initial nearly adopted sequence cluster; In step S3 ', determine that equipment 1 according to the proper vector of sequence in the described initial nearly adopted sequence cluster, carries out clustering processing to the sequence in the described initial nearly adopted sequence cluster, to obtain one or more nearly adopted sequence clusters; In step S4 ', determine that equipment 1 also can be according to described nearly adopted sequence cluster and one group of corresponding preferred Search Results thereof, set up or upgrade described nearly adopted sequence library, wherein, described nearly adopted sequence cluster is corresponding to one group of preferred Search Results; In step S10 ', determine that equipment 1 obtains the search sequence of user's input; In step S11 ', determine equipment 1 according to described search sequence, in described nearly adopted sequence library, carry out matching inquiry, to obtain the target nearly adopted sequence cluster corresponding with described search sequence; In step S12 ', determine that equipment 1 is with at least one offers described user in the corresponding one group of preferred Search Results of the nearly adopted sequence cluster of described target.At this, step S1 ', step S2 ' and step S3 ' with embodiment illustrated in fig. 8 in the content of corresponding step same or similar, for simplicity's sake, so do not repeat them here, and comprise therewith by reference.
Particularly, in step S4 ', determine that equipment 1 is at first according to its described near adopted sequence cluster that obtains in step S3 ', a plurality of Search Results that the sequence with in this nearly adopted sequence cluster that the user that search is recorded in the daily record clicks is complementary are added up, and occurrence number is satisfied Search Results greater than certain threshold value as described one group of preferred Search Results; Then, in step S4 ', determine equipment 1 according to described nearly adopted sequence cluster and one group of corresponding preferred Search Results thereof, set up or upgrade described nearly adopted sequence library that wherein, described nearly adopted sequence cluster is corresponding to one group of preferred Search Results.At this, described one group of preferred Search Results comprises the high-quality that is complementary with described nearly adopted sequence cluster, high authority's Search Results, reach the Search Results that really is consistent with user's search need, it can carry out statistical study and draw by user's search being browsed behavior, as the corresponding Search Results of the page that user's browsing time is long as described in preferred Search Results, with the user click the many Search Results of number of visits as described in preferred Search Results etc.
For example, suppose in step S3 ', determine that the described near adopted sequence cluster synonyms-cluster1 that equipment 1 obtains comprises sequence " Expert English language training by qualified teachers ", " Expert English language training by qualified teachers ", then in step S4 ', determine the sequence " Expert English language training by qualified teachers " that equipment 1 comprises according to this nearly adopted sequence cluster synonyms-cluster1, " Expert English language training by qualified teachers ", a plurality of Search Results that the sequence with in this nearly adopted sequence cluster that the user who records in the search daily record is clicked is complementary are added up, as occurrence number is satisfied greater than certain threshold value as 2 times Search Results as described in preferred Search Results, then in step S4 ', determine that equipment 1 can inquiry obtain the corresponding described preferred Search Results of nearly adopted sequence cluster synonyms-cluster1 and comprises as " EF Englishtown official website, global distinguished Expert English language training by qualified teachers expert " from the search daily record; Then, in step S4 ', nearly adopted sequence cluster synonyms-clusterl and corresponding described one group of preferred Search Results thereof are stored in the nearly adopted sequence library to determine equipment 1, press certain way and upgrade this nearly adopted sequence library, as according to predetermined period, regularly renewal, immediately upgrade as described in nearly adopted sequence library.
Those skilled in the art will be understood that the mode of one group of preferred Search Results of above-mentioned definite described nearly adopted sequence cluster correspondence only is for example; the mode of one group of preferred Search Results of other existing or definite described nearly adopted sequence cluster correspondences that may occur from now on is as applicable to the present invention; also should be included in the protection domain of the present invention, and be contained in this at this with way of reference.
In step S10 ', determine that equipment 1 obtains the search sequence of user's input.At this, in step S10 ', determine that equipment 1 obtains the search sequence of user's input with in step S7 ', it is same or similar to determine that equipment 1 obtains the mode of search sequence of user input, for simplicity's sake, thus do not repeat them here, and comprise therewith by reference.
In step S11 ', determine equipment 1 according to described search sequence, in described nearly adopted sequence library, carry out matching inquiry, to obtain the target nearly adopted sequence cluster corresponding with described search sequence.At this, in step S11 ', determine that equipment 1 obtains the mode of the target nearly adopted sequence cluster corresponding with described search sequence with in step S8 ', the mode of determining the nearly adopted sequence cluster of target that equipment 1 acquisition is corresponding with described search sequence is same or similar, for simplicity's sake, so do not repeat them here, and comprise therewith by reference.
In step S12 ', determine that equipment 1 passes through such as dynamic web page techniques such as ASP, JSP or PHP, the perhaps communication mode of other agreements, as communication protocols such as http or https, with at least one offers described user as " EF Englishtown official website; global distinguished Expert English language training by qualified teachers expert " in the corresponding one group of preferred Search Results of the nearly adopted sequence cluster of described target, as this user's subscriber equipment, browse for the user.
It should be noted that the present invention can be implemented in the assembly of software and/or software and hardware, for example, can adopt special IC (ASIC), general purpose computing machine or any other similar hardware device to realize.In one embodiment, software program of the present invention can carry out to realize step mentioned above or function by processor.Similarly, software program of the present invention (comprising relevant data structure) can be stored in the computer readable recording medium storing program for performing, for example, and RAM storer, magnetic or CD-ROM driver or flexible plastic disc and similar devices.In addition, steps more of the present invention or function can adopt hardware to realize, for example, thereby as cooperate the circuit of carrying out each step or function with processor.
In addition, a part of the present invention can be applied to computer program, and for example computer program instructions when it is carried out by computing machine, by the operation of this computing machine, can call or provide the method according to this invention and/or technical scheme.And call the programmed instruction of method of the present invention, may be stored in fixing or movably in the recording medium, and/or be transmitted by the data stream in broadcasting or other signal bearing medias, and/or be stored in the working storage according to the computer equipment of described programmed instruction operation.At this, comprise a device according to one embodiment of present invention, this device comprises for the storer of storage computer program instructions and is used for the processor of execution of program instructions, wherein, when this computer program instructions is carried out by this processor, trigger this device operation based on aforementioned method according to a plurality of embodiment of the present invention and/or technical scheme.
To those skilled in the art, obviously the invention is not restricted to the details of above-mentioned one exemplary embodiment, and under the situation that does not deviate from spirit of the present invention or essential characteristic, can realize the present invention with other concrete form.Therefore, no matter from which point, all should regard embodiment as exemplary, and be nonrestrictive, scope of the present invention is limited by claims rather than above-mentioned explanation, therefore is intended to be included in the present invention dropping on the implication that is equal to important document of claim and all changes in the scope.Any Reference numeral in the claim should be considered as limit related claim.In addition, obviously other unit or step do not got rid of in " comprising " word, and odd number is not got rid of plural number.A plurality of unit of stating in the device claim or device also can be realized by software or hardware by a unit or device.The first, the second word such as grade is used for representing title, and does not represent any specific order.

Claims (22)

1. method of be used for determining nearly adopted sequence cluster, wherein, this method may further comprise the steps:
It is right that a obtains a plurality of nearly adopted sequences;
B determines that described a plurality of nearly adopted sequence is to corresponding initial nearly adopted sequence cluster;
C carries out clustering processing to the sequence in the described initial nearly adopted sequence cluster, to obtain one or more nearly adopted sequence clusters according to the proper vector of sequence in the described initial nearly adopted sequence cluster.
2. method according to claim 1, wherein, described step a comprises:
-according to a plurality of search daily records, it is right to obtain a plurality of sequence results;
-according to the related information between a plurality of Search Results of described a plurality of sequence results centerings, right to filtering out a plurality of nearly adopted sequences the included sequence from described a plurality of sequence results.
3. method according to claim 1 and 2, wherein, described step b comprises:
-according to the corresponding label of described a plurality of nearly adopted sequence centering sequences, based on the label propagation algorithm, determine that described a plurality of nearly adopted sequence is to corresponding intensive sequence cluster;
-according to described intensive sequence cluster, the sequence of described a plurality of nearly adopted sequence centerings is carried out sequence merge processing, to obtain described initial nearly adopted sequence cluster.
4. according to each described method in the claim 1 to 3, wherein, described step c comprises:
-according to the proper vector of sequence in the described initial nearly adopted sequence cluster, determine the similarity information between the sequence in the described initial nearly adopted sequence cluster;
-according to described similarity information, the sequence in the described initial nearly adopted sequence cluster is carried out clustering processing, to obtain one or more nearly adopted sequence clusters;
Wherein, described proper vector comprises following each characteristic component at least:
The corresponding sequence semantic feature of-described sequence information;
The historical click information of the corresponding Search Results of-described sequence;
The corresponding Search Results summary info of-described sequence.
5. according to each described method in the claim 1 to 3, wherein, described step c comprises:
-according to the proper vector of sequence in the described initial nearly adopted sequence cluster, the sequence in the described initial nearly adopted sequence cluster is carried out clustering processing, to obtain the nearly adopted sequence cluster of one or more candidates;
X carries out denoising to the nearly adopted sequence cluster of described candidate, to obtain described nearly adopted sequence cluster.
6. method according to claim 5, wherein, described step x comprises:
-according to the proper vector of sequence in the nearly adopted sequence cluster of described candidate and the similarity information of corresponding bunch of proper vector of the nearly adopted sequence cluster of this candidate, the nearly adopted sequence cluster of described candidate is carried out denoising, to obtain described nearly adopted sequence cluster.
7. according to each described method in the claim 1 to 6, wherein, this method also comprises:
R is according to described nearly adopted sequence cluster, sets up or adopted sequence library more recently.
8. method according to claim 7, wherein, this method also comprises:
Whether the sequence in the described nearly adopted sequence cluster of-detection is present in other the nearly adopted sequence clusters in the described nearly adopted sequence library;
-Ruo exists, and this sequence is gone redundant the processing, to upgrade described nearly adopted sequence library.
9. according to claim 7 or 8 described methods, wherein, this method also comprises:
-obtain the search sequence of user input;
-according to described search sequence, in described nearly adopted sequence library, carry out matching inquiry, to determine the target nearly adopted sequence cluster corresponding with described search sequence;
-at least one sequence in the nearly adopted sequence cluster of described target is offered described user, with the recommended items as described search sequence.
10. according to claim 7 or 8 described methods, wherein, described step r comprises:
-according to described nearly adopted sequence cluster and one group of corresponding preferred Search Results thereof, set up or upgrade described nearly adopted sequence library, wherein, described nearly adopted sequence cluster is corresponding to one group of preferred Search Results;
Wherein, this method also comprises:
-obtain the search sequence of user input;
-according to described search sequence, in described nearly adopted sequence library, carry out matching inquiry, to obtain the target nearly adopted sequence cluster corresponding with described search sequence;
-with at least one offers described user in the corresponding one group of preferred Search Results of the nearly adopted sequence cluster of described target.
11. one kind is used for determining nearly adopted sequence cluster locking equipment really, wherein, this determines that equipment comprises:
Deriving means, it is right to be used for obtaining a plurality of nearly adopted sequences;
Initial definite device is used for determining that described a plurality of nearly adopted sequences are to corresponding initial nearly adopted sequence cluster;
The sequence cluster deriving means is used for the proper vector according to described initial nearly adopted sequence cluster sequence, the sequence in the described initial nearly adopted sequence cluster is carried out clustering processing, to obtain one or more nearly adopted sequence clusters.
12. definite equipment according to claim 11, wherein, described deriving means is used for:
-according to a plurality of search daily records, it is right to obtain a plurality of sequence results;
-according to the related information between a plurality of Search Results of described a plurality of sequence results centerings, right to filtering out a plurality of nearly adopted sequences the included sequence from described a plurality of sequence results.
13. according to claim 11 or 12 described definite equipment, wherein, described initial definite device comprises:
Intensive determining unit is used for according to the corresponding label of described a plurality of nearly adopted sequence centering sequences, based on the label propagation algorithm, determines that described a plurality of nearly adopted sequence is to corresponding intensive sequence cluster;
The sequence merge cells is used for according to described intensive sequence cluster, the sequence of described a plurality of nearly adopted sequence centerings is carried out sequence merge processing, to obtain described initial nearly adopted sequence cluster.
14. according to each described definite equipment in the claim 11 to 13, wherein, described sequence cluster deriving means is used for:
-according to the proper vector of sequence in the described initial nearly adopted sequence cluster, determine the similarity information between the sequence in the described initial nearly adopted sequence cluster;
-according to described similarity information, the sequence in the described initial nearly adopted sequence cluster is carried out clustering processing, to obtain one or more nearly adopted sequence clusters;
Wherein, described proper vector comprises following each characteristic component at least:
The corresponding sequence semantic feature of-described sequence information;
The historical click information of the corresponding Search Results of-described sequence;
The corresponding Search Results summary info of-described sequence.
15. according to each described definite equipment in the claim 11 to 13, wherein, described sequence cluster deriving means comprises:
Candidate's acquiring unit is used for the proper vector according to described initial nearly adopted sequence cluster sequence, the sequence in the described initial nearly adopted sequence cluster is carried out clustering processing, to obtain the nearly adopted sequence cluster of one or more candidates;
The denoising unit is used for the nearly adopted sequence cluster of described candidate is carried out denoising, to obtain described nearly adopted sequence cluster.
16. definite equipment according to claim 15, wherein, described denoising unit is used for:
-according to the proper vector of sequence in the nearly adopted sequence cluster of described candidate and the similarity information of corresponding bunch of proper vector of the nearly adopted sequence cluster of this candidate, the nearly adopted sequence cluster of described candidate is carried out denoising, to obtain described nearly adopted sequence cluster.
17. according to each described definite equipment in the claim 11 to 16, wherein, this determines that equipment also comprises:
The sequence library apparatus for establishing is used for according to described nearly adopted sequence cluster, sets up or adopted sequence library more recently.
18. definite equipment according to claim 17, wherein, this determines that equipment also comprises:
Whether pick-up unit is present in other nearly adopted sequence clusters in the described nearly adopted sequence library for detection of the sequence in the described nearly adopted sequence cluster;
Remove redundant apparatus, be used for this sequence being gone redundant the processing, to upgrade described nearly adopted sequence library if exist.
19. according to claim 17 or 18 described definite equipment, wherein, this determines that equipment also comprises:
The first sequence deriving means is used for obtaining the search sequence of user's input;
First inquiry unit is used for according to described search sequence, carries out matching inquiry in described nearly adopted sequence library, to determine the target nearly adopted sequence cluster corresponding with described search sequence;
First generator is used for nearly at least one sequence of adopted sequence cluster of described target is offered described user, with the recommended items as described search sequence.
20. according to claim 17 or 18 described definite equipment, wherein, described sequence library apparatus for establishing is used for:
-according to described nearly adopted sequence cluster and one group of corresponding preferred Search Results thereof, set up or upgrade described nearly adopted sequence library, wherein, described nearly adopted sequence cluster is corresponding to one group of preferred Search Results;
Wherein, this determines that equipment also comprises:
The second sequence deriving means is used for obtaining the search sequence of user's input;
Second inquiry unit is used for according to described search sequence, carries out matching inquiry in described nearly adopted sequence library, to obtain the target nearly adopted sequence cluster corresponding with described search sequence;
Second generator is at least one offers described user with the corresponding one group of preferred Search Results of the nearly adopted sequence cluster of described target.
21. a search engine that is used for determining nearly adopted sequence cluster, wherein, this search engine comprises as each described definite equipment in the claim 11 to 20.
22. a search engine plug-in unit that is used for determining nearly adopted sequence cluster, wherein, this search engine plug-in unit comprises as each described definite equipment in the claim 11 to 20.
CN201310105086.XA 2013-03-28 2013-03-28 A kind of method and apparatus for determining nearly justice sequence cluster Active CN103246697B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201310105086.XA CN103246697B (en) 2013-03-28 2013-03-28 A kind of method and apparatus for determining nearly justice sequence cluster

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201310105086.XA CN103246697B (en) 2013-03-28 2013-03-28 A kind of method and apparatus for determining nearly justice sequence cluster

Publications (2)

Publication Number Publication Date
CN103246697A true CN103246697A (en) 2013-08-14
CN103246697B CN103246697B (en) 2016-12-28

Family

ID=48926217

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310105086.XA Active CN103246697B (en) 2013-03-28 2013-03-28 A kind of method and apparatus for determining nearly justice sequence cluster

Country Status (1)

Country Link
CN (1) CN103246697B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105786851A (en) * 2014-12-23 2016-07-20 北京奇虎科技有限公司 Question and answer knowledge base construction method as well as search provision method and apparatus
CN111428476A (en) * 2019-01-09 2020-07-17 百度在线网络技术(北京)有限公司 Synonym generation method and device, electronic equipment and storage medium
CN112925912A (en) * 2021-02-26 2021-06-08 北京百度网讯科技有限公司 Text processing method, and synonymous text recall method and device

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070250500A1 (en) * 2005-12-05 2007-10-25 Collarity, Inc. Multi-directional and auto-adaptive relevance and search system and methods thereof
CN101241502A (en) * 2008-03-13 2008-08-13 复旦大学 XML document keyword searching and clustering method based on semantic distance model
CN101308496A (en) * 2008-07-04 2008-11-19 沈阳格微软件有限责任公司 Large scale text data external clustering method and system
CN102043845A (en) * 2010-12-08 2011-05-04 百度在线网络技术(北京)有限公司 Method and equipment for extracting core keywords based on query sequence cluster

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070250500A1 (en) * 2005-12-05 2007-10-25 Collarity, Inc. Multi-directional and auto-adaptive relevance and search system and methods thereof
CN101241502A (en) * 2008-03-13 2008-08-13 复旦大学 XML document keyword searching and clustering method based on semantic distance model
CN101308496A (en) * 2008-07-04 2008-11-19 沈阳格微软件有限责任公司 Large scale text data external clustering method and system
CN102043845A (en) * 2010-12-08 2011-05-04 百度在线网络技术(北京)有限公司 Method and equipment for extracting core keywords based on query sequence cluster

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105786851A (en) * 2014-12-23 2016-07-20 北京奇虎科技有限公司 Question and answer knowledge base construction method as well as search provision method and apparatus
CN111428476A (en) * 2019-01-09 2020-07-17 百度在线网络技术(北京)有限公司 Synonym generation method and device, electronic equipment and storage medium
CN111428476B (en) * 2019-01-09 2023-03-31 百度在线网络技术(北京)有限公司 Synonym generation method and device, electronic equipment and storage medium
CN112925912A (en) * 2021-02-26 2021-06-08 北京百度网讯科技有限公司 Text processing method, and synonymous text recall method and device
CN112925912B (en) * 2021-02-26 2024-01-12 北京百度网讯科技有限公司 Text processing method, synonymous text recall method and apparatus

Also Published As

Publication number Publication date
CN103246697B (en) 2016-12-28

Similar Documents

Publication Publication Date Title
CN101364239B (en) Method for auto constructing classified catalogue and relevant system
CN103226578B (en) Towards the website identification of medical domain and the method for webpage disaggregated classification
Kang et al. Construction of a large-scale test set for author disambiguation
CN107480158A (en) The method and system of the matching of content item and image is assessed based on similarity score
CN107145496A (en) The method for being matched image with content item based on keyword
CN103823824A (en) Method and system for automatically constructing text classification corpus by aid of internet
CN103631794A (en) Method, device and equipment for sorting search results
CN103544188A (en) Method and device for pushing mobile internet content based on user preference
CN106383887A (en) Environment-friendly news data acquisition and recommendation display method and system
CN104484343A (en) Topic detection and tracking method for microblog
CN103294781A (en) Method and equipment used for processing page data
Baralis et al. Analysis of twitter data using a multiple-level clustering strategy
CN103678412A (en) Document retrieval method and device
Prajapati A survey paper on hyperlink-induced topic search (HITS) algorithms for web mining
CN103020123A (en) Method for searching bad video website
CN107145497A (en) The method of the image of metadata selected and content matching based on image and content
CN103744887A (en) Method and device for people search and computer equipment
Zubiaga et al. Content-based clustering for tag cloud visualization
CN103116635A (en) Field-oriented method and system for collecting invisible web resources
CN103745380A (en) Advertisement delivery method and apparatus
CN108959641A (en) A kind of content information recommended method and system based on artificial intelligence
Chen et al. Finding keywords in blogs: Efficient keyword extraction in blog mining via user behaviors
CN103246697A (en) Method and equipment for determining near-synonymy sequence clusters
CN110321446A (en) Related data recommended method, device, computer equipment and storage medium
CN103235784A (en) Method and equipment used for obtaining search results

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant