CN101853272A - Search engine technology based on relevance feedback and clustering - Google Patents

Search engine technology based on relevance feedback and clustering Download PDF

Info

Publication number
CN101853272A
CN101853272A CN 201010165586 CN201010165586A CN101853272A CN 101853272 A CN101853272 A CN 101853272A CN 201010165586 CN201010165586 CN 201010165586 CN 201010165586 A CN201010165586 A CN 201010165586A CN 101853272 A CN101853272 A CN 101853272A
Authority
CN
China
Prior art keywords
document
class
vector
uncorrelated
cluster
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN 201010165586
Other languages
Chinese (zh)
Other versions
CN101853272B (en
Inventor
李新叶
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
North China Electric Power University
Original Assignee
North China Electric Power University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by North China Electric Power University filed Critical North China Electric Power University
Priority to CN2010101655869A priority Critical patent/CN101853272B/en
Publication of CN101853272A publication Critical patent/CN101853272A/en
Application granted granted Critical
Publication of CN101853272B publication Critical patent/CN101853272B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Abstract

The invention relates to a search engine technology based on relevance feedback and clustering. By simultaneously utilizing user relevance feedback information and relavancy sequencing to direct the clustering of retrieval results, the invention ensures that the final partitioning of the retrieval results meet user query requirements; and in a clustering process, a large amount of documents and repeated webpage which are irrelevant to a user are removed, the clustering speed is improved and the retrieval results are optimized at the same time. In the clustering process, a clustering center is not modified by a clustering cluster irrelevant to the user, thereby result documents relevant to the user are ensured not to be lost when noise is introduced in irrelevant document clustering.

Description

Search engine technique based on relevant feedback and cluster
Technical field
The present invention relates to internet information retrieval technique field, relate in particular to a kind of Web result for retrieval optimization method based on relevant feedback and cluster.
Background technology
At present, search engine mostly is based on keyword and carries out index and retrieval, and according to the lists of keywords of user's input, search engine is searched index database, with the document of coupling according to the different sequencing display of the degree of correlation of user inquiring.Because keyword has the polysemy phenomenon, and the keyword that the user often only imports is seldom retrieved, the search result list that makes search engine return has comprised a lot of themes document uncorrelated, mixed in together usually, the user must browse the result for retrieval tabulation one by one to find relevant documentation, the webpage that wherein also has many contents to repeat, browsing information can be wasted many times of user and great effort from such result for retrieval.
The user's browses for convenience, some researchists are used for Web information retrieval resulting class with the automatic cluster technology and divide, the document that will have similar features (for example belonging to a theme) is placed on same group, so that the user dwindles seek scope, only in own interested minority group, search and browse the document of being concerned about.But the automatic cluster of result for retrieval is not had to consider and user's correlativity that cause result for retrieval can not reflect user's specific intent and professional domain, the user can not be according to the mode of own needs and interest selection clustering documents.In addition, its result for retrieval enormous amount on the Web search engine, the research of existing automatic cluster is to comprise a large amount of and the incoherent result of user carries out cluster to whole result for retrieval, cluster process needs the time long, thereby influences the performance of search engine.
For the cluster that makes result for retrieval is relevant with specific user's query demand, what a kind of result for retrieval based on inquiry log occurred partly instructs clustering method.This method obtains the must-link constraint according to the record data that user in the inquiry log clicks the result, concrete grammar is that the supposition user has clicked two result for retrieval with one page, think that then they are relevant with user inquiring, can draw thus and have the must-link restriction relation between them.Consider that the must-link constraint of selecting owing to individual's reason can have noise, this method is the generation frequency of these constraints in the statistical query daily record at first, select then frequency greater than the constraint of certain threshold value as final must-link constraint.Can obtain constraint with the daily record of the method traversal queries, partly instruct cluster according to what result for retrieval was carried out in constraint at last about the must-link of each inquiry.Owing to do not comprise user's all possible inquiry in the inquiry log, the new inquiry for user's input can not obtain restriction relation from inquiry log; In addition, the result who has guaranteed the must-link constraint when cluster is in same clustering, the result of can not-link constraint is not in same clustering, do not consider the optimization of cluster process, as a result still can length consuming time during cluster according to this method to all carrying out clustering processing with user-dependent and incoherent result for retrieval to the Web information retrieval, influence the performance of search engine.
Another kind of field feedback is attached to the method for text cluster, needs the user at first to specify and belong to some example documents that cluster to instruct cluster process.Then cluster result is presented to the user, by the customer inspection cluster result and provide some feedback informations, for example point out that document d should belong to cluster S or should not belong to the S that clusters; Document d should be from the S that clusters iChange to the S that clusters jTwo documents should cluster or should not cluster same same.Instruct the next round cluster process according to field feedback, again with user interactions, up to obtaining customer satisfaction system cluster result.When being clustered modeling, each has used the feature partial weight to reflect the importance of a feature that clusters.Improve the quality of feature partial weight by increasing more constraints more accurately, thereby improve the cluster effect.This method has mainly been considered the validity of text cluster, but need the user repeatedly to import feedback information, increased user's burden, needed the user to specify during cluster especially first and belong to some example documents that cluster to instruct cluster process, increased difficulty to the user; And the process of cluster length consuming time, be not suitable for Web information retrieval result's cluster.
Summary of the invention
The present invention is directed to that said method exists to need the user repeatedly to import complicated feedback information or inquiry log invalid to new inquiry, and exist irrelevant document class or document still to have drawbacks such as a large amount of duplicate contents in clustering during whole result for retrieval clusters length consuming time, result divided, a kind of method that needs relevant with query demand and the incoherent small part feedback information of user's input instruct optimization Web result for retrieval is provided.
The present invention adopts following technical method:
(1) determine initial clustering classification number and initial cluster center vector of all categories, comprising:
The relevant documentation that the user is chosen from result for retrieval divides a class into, is called the relevant documentation class, determines the initial cluster center of relevant documentation class; The initial cluster center vector of relevant documentation class obtains by asking for the weighted mean of each keyword in such each document.
Uncorrelated document is divided into one or several uncorrelated document class, determines the initial cluster center of every class, comprising:
-select a uncorrelated document as first uncorrelated document class, the proper vector of the document is the cluster centre vector of the document class
The similarity of all the other uncorrelated documents of-calculating and above-mentioned classification, be divided in certain the most close uncorrelated classification or be divided into new uncorrelated class according to the similarity value, if be divided into a new class, then the proper vector of the document is such cluster centre vector
(2) initial division and definite final cluster classification number;
The document that the user does not choose in the tabulation of calculating result for retrieval and the similarity of relevant documentation class and uncorrelated document class, carry out following processing according to the size of similarity value:
-be divided in certain the most close document class
-or being divided into new document class, the document proper vector is such cluster centre vector;
-or judge the document that belongs to duplicate contents and with its deletion
(3) remove the document that the middle content of each document class (clustering) in the initial division repeats;
Certain document d1 from such begins, calculate proper vector of the document and the similarity between each document vector thereafter, judge according to the similarity value whether certain document repeats with document d1 content, if then from result for retrieval tabulation and the document class, delete the document that repeats with the document d1 content;
Whether the next one from the result for retrieval tabulation of having upgraded begins then, calculate proper vector of the document and the similarity between the proper vector of each document thereafter, and be the judgement of repetitive file.
Repeat said process, last up to the result for retrieval tabulation.
(4) the cluster centre vector of other classification of modification except uncorrelated document class;
The initial cluster center vector of class obtains by asking for the weighted mean of each keyword in such each document.
(5) recomputate user in the result for retrieval tabulation unchecked other with the similarity of each cluster centre, divide again, comprising:
-calculate the proper vector of each document and the similarity between each classification cluster centre vector, document is divided in the most close classification.
If-certain document belongs to uncorrelated document class, and after the relevancy ranking of itself and inquiry leans on, then never delete the document in the tabulation of relevant documentation classification and result for retrieval respectively.
(6) repeating step (4) and (5) are up to satisfying end condition.
The present invention utilizes user's related feedback information and relevancy ranking to instruct the cluster of result for retrieval simultaneously, makes the final division of result for retrieval meet the user inquiring demand; In cluster process, remove a large amount of and incoherent document of user and repeated pages, improved cluster speed, optimized result for retrieval simultaneously.In cluster process, do not revise cluster centre with the incoherent similar cluster of user, guaranteed can in uncorrelated document clusters, not lose and user-dependent result document because of introducing noise.
Description of drawings
Below in conjunction with accompanying drawing the present invention is elaborated:
Fig. 1 is a process flow diagram of the present invention.
Embodiment
Step S101: the user selects relevant document and incoherent document from search engine retrieving result;
Step S102: determine initial clustering classification number and initial cluster center;
Suppose that document is d1 in the result for retrieval tabulation, d2 ..ds (s is a number of files), the keyword of supposing index in the index database of searching system does not comprise stop words, chooses document d1 in the index database of searching system, d2, ..ds middle keyword weight, be the frequency that keyword occurs in document, greater than the keyword t1 of preset threshold value δ k, t2, t3, ..tn (n is the keyword number) constitutes dimension vectorial in the vector space model, and then the proper vector di of document di is defined as:
di=(w i1,w i2,...,w in) (1)
Wherein, w Ij=tf Ij(i=1,2 ... s, j=1,2 ... n), tf IjBe j the frequency that keyword occurs in i document di.
1. extract the public characteristic vector of relevant documentation:
The relevant documentation that the user is chosen is represented with C1 as a relevant documentation class.Suppose that the relevant documentation in the C1 document class is d1, d2 ..dm (the relevant documentation number that m chooses for the user), keyword t1 then, t2, t3, the weight of ..tn in the C1 class is respectively:
a 1 j = Σ i = 1 m tf ij m , ( j = 1,2 , . . . n ) - - - ( 2 )
The initial cluster center vector of C1 class is defined as:
C1 center=(a 11,a 12,..a 1n)
This moment, the cluster classification was counted k=1.
2. uncorrelated document is carried out category division:
The uncorrelated document that the user chooses may belong to same document class, also may belong to different document class.Divide according to following steps for t uncorrelated document:
-optional uncorrelated document be designated as di (i=1,2 ... t), the cluster classification is counted k=k+1, and document di is divided into the Ck document class, and the proper vector di of document di is as the cluster centre vector Ck of Ck document class Center
-remaining t-1 uncorrelated document repeated following process:
Similarity between the proper vector di of calculating document di and the cluster centre vector of each uncorrelated document class, calculating formula of similarity adopts the vector angle cosine formula:
sim ( di , Cj center ) = Σ v = 1 n w iv × a jv ( Σ v = 1 n w iv 2 ) × ( Σ v = 1 n a jv 2 ) - - - ( 3 )
Wherein, Cj CenterBe the cluster centre vector of j document class, w IvBe the weight of v keyword in i document di, formula (1) is seen in its definition; a JvBe the weight of v keyword in j cluster Cj.
If di and certain existing uncorrelated document class Cg (g=2,3 ... cluster centre vector Cg k) CenterThe similarity value is maximum and when surpassing setting threshold δ 1, then document di is classified as the Cg document class;
If the similarity value of the cluster centre vector of di and current all uncorrelated classifications then makes k=k+1 all less than setting threshold δ 1, document di is divided into new document class Ck, the proper vector di of document di is as the cluster centre vector Ck of Ck document class Center
Said process finishes, and uncorrelated document is divided into d=k-1 classification.Initial cluster classification number is k.
Step S103: determine that initial division and final cluster classification count k;
To each document di in the unchecked the results list of user, repeat following process:
1. the similarity between the cluster centre vector of the proper vector di of calculating document di and each document class is calculated employing formula (3).
2. if the cluster centre of proper vector di and r document class Cr vector Cr CenterThe similarity value is maximum and when surpassing setting threshold δ 1:
If-similarity value then is classified as the Cr class with document di less than setting threshold δ 2 (δ 2>δ 1);
-otherwise think that document di is a duplicate pages, deletion document di from the result for retrieval tabulation.
3. if the similarity value of the cluster centre vector of a proper vector di and a current k document class then makes k=k+1 all less than setting threshold δ 1, document di is divided into new document class Ck, the proper vector di of document di is as the cluster centre vector Ck of Ck Center
Said process finishes, and initial division forms, and final cluster classification number is k.
Step S104: remove the duplicate contents that is divided into k the webpage in the document class;
If there be p sets of documentation to become lists of documents d1, d2 ..dp in a certain document class.
From document d1, calculate the document vector and the similarity between p-1 document vector thereafter, if and the similarity value between the proper vector of document dx is greater than setting threshold δ 2, think that then both are repeated pages, deletion document dx from result for retrieval tabulation and the document class revises p=p-1 respectively;
D2 from the result for retrieval tabulation of having upgraded begins then, calculate the document vector and the similarity between p-2 document vector thereafter, if and the similarity value between the proper vector of document dy is greater than setting threshold δ 2, think that then both are repeated pages, deletion document dy from result for retrieval tabulation and this cluster revises p=p-1 respectively;
Repeat said process, last up to the result for retrieval tabulation.
Step S105: the cluster centre vector of revising other document class except that d uncorrelated classification;
Step S106: recomputate the similarity of unchecked other document of user and this k class in the result for retrieval tabulation, and divide;
To unchecked each the document di of user in updated search the results list, repeat following process:
1. calculate similarity between the cluster centre vector of the proper vector di of document di and each document class according to formula (3), di is divided in the most close document class with document.
2., and after the relevancy ranking of itself and inquiry leans on, then never delete the document in the tabulation of relevant documentation class and result for retrieval respectively if document di belongs to uncorrelated document class.
Step S107: repeat 105 and 106, up to satisfying end condition.
It is minimum or less than the iterations of setting to set end condition and be target function value.

Claims (6)

1. the search engine technique based on relevant feedback and cluster is characterized in that, may further comprise the steps:
Step 1: determine initial clustering classification number and initial cluster center vector of all categories, comprising:
The relevant documentation that the user is chosen from result for retrieval divides the relevant documentation class into, determines the initial cluster center vector of this relevant documentation class; Described initial cluster center vector obtains by asking for the weighted mean of each keyword in this each document of relevant documentation class;
Uncorrelated document is divided into one or several uncorrelated document class, and determines the initial cluster center vector of described independent document class, comprising:
-select a uncorrelated document as first uncorrelated document class, and proper vector that will this uncorrelated document is defined as the cluster centre vector of described uncorrelated document class;
-calculate the similarity of all the other uncorrelated documents and above-mentioned uncorrelated document class respectively, and will this uncorrelated document be divided in current certain uncorrelated document class the most close with it or be divided in the new uncorrelated document class according to gained similarity value, and proper vector that will this uncorrelated document be defined as the cluster centre vector of described new uncorrelated document class;
Step 2: determine initial division and final cluster classification number;
Calculate the user does not choose in the result for retrieval tabulation the document and the similarity of described relevant documentation class and uncorrelated document class respectively, carry out following processing according to the size of similarity value:
-the document is divided in current certain document class the most close with it;
-or the document is divided into new document class, and the proper vector of the document is defined as the cluster centre vector of described new document class;
-or judge that the document belongs to document that content repeats and with its deletion;
Step 3: remove the document that content repeats in each document class in the initial division;
First document from the document class begins, calculate proper vector of the document and the similarity between the proper vector of each document thereafter, judge according to the similarity value whether the document repeats with other document content, if content repeats, then from result for retrieval tabulation and the document class, delete the document that repeats with the document content;
The next one from updated search the results list begins then, calculates the similarity between the proper vector of the proper vector of the document and each document thereafter, judges and delete the document of content repetition in view of the above;
Repeat said process, last up to the result for retrieval tabulation;
Step 4: the cluster centre vector of revising other document class except that uncorrelated document class;
Described cluster centre vector obtains by asking for the weighted mean of each keyword in each document of the document class;
Step 5: recomputate the similarity of the cluster centre vector of the proper vector of unchecked other document of user in the result for retrieval tabulation and current each document class, and divide in view of the above, comprising:
-document is divided in the document class the most close with it;
If-certain document belongs to a certain uncorrelated document class, and after the relevancy ranking of the document and inquiry leans on, then never delete the document in the tabulation of relevant documentation class and result for retrieval respectively;
Step 6: repeating step four and five, up to satisfying end condition.
2. the method for claim 1 is characterized in that, the proper vector of described document is defined as:
di=(w i1,w i2,...,w in)
Wherein, di is the proper vector of document di, w Ij=tf Ij(j=1,2...n, n are the keyword number), tf IjBe j the frequency that keyword occurs in document di.
3. the method for claim 1 is characterized in that, the computing formula of the weighted mean of keyword j in each document of document class r is in described step 1 and the step 4:
a rj = Σ i = 1 m tf ij m
Wherein, m is the number of files in the document class.
4. the method for claim 1 is characterized in that, the cluster centre vector representation of a certain document class r is:
Cr center=(a r1,a r2,..a rn)。
5. the method for claim 1 is characterized in that, the calculation of similarity degree in the described step 1 to five between the cluster centre vector of the proper vector of document and document class adopts the vector angle cosine formula:
sim ( di , Cj center ) = Σ v = 1 n w iv × a jv ( Σ v = 1 n w iv 2 ) × ( Σ v = 1 n a jv 2 ) .
6. as claim 1 or 5 described methods, it is characterized in that, in the described step 2 by threshold value being set and comparing with similarity, thereby judge the processing mode that should take a certain document, concrete, if when described similarity value surpasses preset threshold δ 1:
If-described similarity value less than preset threshold δ 2 (δ 2>δ 1), then is included into the document a document class the highest with the document similarity;
-otherwise judge that the document is the document that content repeats.
CN2010101655869A 2010-04-30 2010-04-30 Search engine technology based on relevance feedback and clustering Expired - Fee Related CN101853272B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2010101655869A CN101853272B (en) 2010-04-30 2010-04-30 Search engine technology based on relevance feedback and clustering

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2010101655869A CN101853272B (en) 2010-04-30 2010-04-30 Search engine technology based on relevance feedback and clustering

Publications (2)

Publication Number Publication Date
CN101853272A true CN101853272A (en) 2010-10-06
CN101853272B CN101853272B (en) 2012-07-04

Family

ID=42804764

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2010101655869A Expired - Fee Related CN101853272B (en) 2010-04-30 2010-04-30 Search engine technology based on relevance feedback and clustering

Country Status (1)

Country Link
CN (1) CN101853272B (en)

Cited By (23)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101986297A (en) * 2010-10-28 2011-03-16 浙江大学 Accessibility web browsing method based on linkage cluster
CN102073718A (en) * 2011-01-10 2011-05-25 清华大学 System and method for explaining, erasing and modifying search result in probabilistic database
CN102419779A (en) * 2012-01-13 2012-04-18 青岛理工大学 Method and device for personalized searching of commodities sequenced based on attributes
CN102629272A (en) * 2012-03-14 2012-08-08 北京邮电大学 Clustering based optimization method for examination system database
CN102654879A (en) * 2011-03-04 2012-09-05 中兴通讯股份有限公司 Search method and device
CN102693304A (en) * 2012-05-22 2012-09-26 北京邮电大学 Search engine feedback information processing method and search engine
CN102737045A (en) * 2011-04-08 2012-10-17 北京百度网讯科技有限公司 Method and device for relevancy computation
CN102867006A (en) * 2011-07-07 2013-01-09 富士通株式会社 Method and system for batching and clustering
CN102890698A (en) * 2012-06-20 2013-01-23 杜小勇 Method for automatically describing microblogging topic tag
CN102968465A (en) * 2012-11-09 2013-03-13 同济大学 Network information service platform and search service method based on network information service platform
CN103034709A (en) * 2012-12-07 2013-04-10 北京海量融通软件技术有限公司 System and method for resequencing search results
CN103870474A (en) * 2012-12-11 2014-06-18 北京百度网讯科技有限公司 News topic organizing method and device
CN104699817A (en) * 2015-03-24 2015-06-10 中国人民解放军国防科学技术大学 Search engine ordering method and search engine ordering system based on improved spectral clusters
CN105068996A (en) * 2015-09-21 2015-11-18 哈尔滨工业大学 Chinese participle increment learning method
CN105160014A (en) * 2015-09-24 2015-12-16 四川师范大学 Data processing method and apparatus
CN105634841A (en) * 2014-10-29 2016-06-01 任子行网络技术股份有限公司 Method and device for decreasing redundant logs of network auditing system
CN105868261A (en) * 2015-12-31 2016-08-17 乐视网信息技术(北京)股份有限公司 Method and device for obtaining and ranking associated information
CN106294394A (en) * 2015-05-20 2017-01-04 北大方正集团有限公司 Data clustering method and data clustering system
CN110245275A (en) * 2019-06-18 2019-09-17 中电科大数据研究院有限公司 A kind of extensive similar quick method for normalizing of headline
CN111178455A (en) * 2020-01-07 2020-05-19 重庆中科云从科技有限公司 Image clustering method, system, device and medium
CN111966894A (en) * 2020-08-05 2020-11-20 深圳市欢太科技有限公司 Information query method and device, storage medium and electronic equipment
WO2022100071A1 (en) * 2020-11-10 2022-05-19 北京捷通华声科技股份有限公司 Voice text clustering method and apparatus
CN115408491A (en) * 2022-11-02 2022-11-29 京华信息科技股份有限公司 Text retrieval method and system for historical data

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070185871A1 (en) * 2006-02-08 2007-08-09 Telenor Asa Document similarity scoring and ranking method, device and computer program product
CN101271476A (en) * 2008-04-25 2008-09-24 清华大学 Relevant feedback retrieval method based on clustering in network image search
US20090019026A1 (en) * 2007-07-09 2009-01-15 Vivisimo, Inc. Clustering System and Method
CN101436201A (en) * 2008-11-26 2009-05-20 哈尔滨工业大学 Characteristic quantification method of graininess-variable text cluster
CN101458708A (en) * 2008-12-05 2009-06-17 北京大学 Searching result clustering method and device

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070185871A1 (en) * 2006-02-08 2007-08-09 Telenor Asa Document similarity scoring and ranking method, device and computer program product
US20090019026A1 (en) * 2007-07-09 2009-01-15 Vivisimo, Inc. Clustering System and Method
CN101271476A (en) * 2008-04-25 2008-09-24 清华大学 Relevant feedback retrieval method based on clustering in network image search
CN101436201A (en) * 2008-11-26 2009-05-20 哈尔滨工业大学 Characteristic quantification method of graininess-variable text cluster
CN101458708A (en) * 2008-12-05 2009-06-17 北京大学 Searching result clustering method and device

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
《中国博士学位论文全文数据库》 20091130 李新叶 基于XML文档结构语义的信息检索方法与应用研究 , 2 *
《计算机工程》 20061020 李新叶,等 一种用于Web搜索的高效聚类算法 第32卷, 第20期 2 *

Cited By (39)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101986297A (en) * 2010-10-28 2011-03-16 浙江大学 Accessibility web browsing method based on linkage cluster
CN102073718B (en) * 2011-01-10 2013-01-30 清华大学 System and method for explaining, erasing and modifying search result in probabilistic database
CN102073718A (en) * 2011-01-10 2011-05-25 清华大学 System and method for explaining, erasing and modifying search result in probabilistic database
CN102654879B (en) * 2011-03-04 2015-01-28 中兴通讯股份有限公司 Search method and device
CN102654879A (en) * 2011-03-04 2012-09-05 中兴通讯股份有限公司 Search method and device
CN102737045A (en) * 2011-04-08 2012-10-17 北京百度网讯科技有限公司 Method and device for relevancy computation
CN102737045B (en) * 2011-04-08 2014-02-19 北京百度网讯科技有限公司 Method and device for relevancy computation
CN102867006A (en) * 2011-07-07 2013-01-09 富士通株式会社 Method and system for batching and clustering
CN102867006B (en) * 2011-07-07 2016-04-13 富士通株式会社 One is clustering method and system in batches
CN102419779A (en) * 2012-01-13 2012-04-18 青岛理工大学 Method and device for personalized searching of commodities sequenced based on attributes
CN102419779B (en) * 2012-01-13 2014-06-11 青岛理工大学 Method and device for personalized searching of commodities sequenced based on attributes
CN102629272A (en) * 2012-03-14 2012-08-08 北京邮电大学 Clustering based optimization method for examination system database
CN102693304B (en) * 2012-05-22 2014-10-22 北京邮电大学 Search engine feedback information processing method and search engine
CN102693304A (en) * 2012-05-22 2012-09-26 北京邮电大学 Search engine feedback information processing method and search engine
CN102890698B (en) * 2012-06-20 2015-06-24 杜小勇 Method for automatically describing microblogging topic tag
CN102890698A (en) * 2012-06-20 2013-01-23 杜小勇 Method for automatically describing microblogging topic tag
CN102968465A (en) * 2012-11-09 2013-03-13 同济大学 Network information service platform and search service method based on network information service platform
CN102968465B (en) * 2012-11-09 2015-07-29 同济大学 Network information service platform and the search service method based on this platform thereof
CN103034709A (en) * 2012-12-07 2013-04-10 北京海量融通软件技术有限公司 System and method for resequencing search results
CN103034709B (en) * 2012-12-07 2017-05-31 北京海量融通软件技术有限公司 Retrieving result reordering system and method
CN103870474A (en) * 2012-12-11 2014-06-18 北京百度网讯科技有限公司 News topic organizing method and device
CN103870474B (en) * 2012-12-11 2018-06-08 北京百度网讯科技有限公司 A kind of news topic method for organizing and device
CN105634841B (en) * 2014-10-29 2018-12-11 任子行网络技术股份有限公司 A kind of method and apparatus reducing network audit system redundant logs
CN105634841A (en) * 2014-10-29 2016-06-01 任子行网络技术股份有限公司 Method and device for decreasing redundant logs of network auditing system
CN104699817A (en) * 2015-03-24 2015-06-10 中国人民解放军国防科学技术大学 Search engine ordering method and search engine ordering system based on improved spectral clusters
CN104699817B (en) * 2015-03-24 2018-01-05 中国人民解放军国防科学技术大学 A kind of method for sequencing search engines and system based on improvement spectral clustering
CN106294394B (en) * 2015-05-20 2019-10-15 北大方正集团有限公司 Data clustering method and data clustering system
CN106294394A (en) * 2015-05-20 2017-01-04 北大方正集团有限公司 Data clustering method and data clustering system
CN105068996A (en) * 2015-09-21 2015-11-18 哈尔滨工业大学 Chinese participle increment learning method
CN105068996B (en) * 2015-09-21 2017-11-17 哈尔滨工业大学 A kind of Chinese word segmentation Increment Learning Algorithm
CN105160014A (en) * 2015-09-24 2015-12-16 四川师范大学 Data processing method and apparatus
CN105868261A (en) * 2015-12-31 2016-08-17 乐视网信息技术(北京)股份有限公司 Method and device for obtaining and ranking associated information
CN110245275A (en) * 2019-06-18 2019-09-17 中电科大数据研究院有限公司 A kind of extensive similar quick method for normalizing of headline
CN110245275B (en) * 2019-06-18 2023-09-01 中电科大数据研究院有限公司 Large-scale similar news headline rapid normalization method
CN111178455A (en) * 2020-01-07 2020-05-19 重庆中科云从科技有限公司 Image clustering method, system, device and medium
CN111966894A (en) * 2020-08-05 2020-11-20 深圳市欢太科技有限公司 Information query method and device, storage medium and electronic equipment
WO2022100071A1 (en) * 2020-11-10 2022-05-19 北京捷通华声科技股份有限公司 Voice text clustering method and apparatus
CN115408491A (en) * 2022-11-02 2022-11-29 京华信息科技股份有限公司 Text retrieval method and system for historical data
CN115408491B (en) * 2022-11-02 2023-01-17 京华信息科技股份有限公司 Text retrieval method and system for historical data

Also Published As

Publication number Publication date
CN101853272B (en) 2012-07-04

Similar Documents

Publication Publication Date Title
CN101853272B (en) Search engine technology based on relevance feedback and clustering
CN101174273B (en) News event detecting method based on metadata analysis
CN100465954C (en) Reinforced clustering of multi-type data objects for search term suggestion
US9104733B2 (en) Web search ranking
CN102760138B (en) Classification method and device for user network behaviors and search method and device for user network behaviors
CN101133388B (en) Multiple index based information retrieval system
CN101876981B (en) A kind of method and device building knowledge base
CN102253982B (en) Query suggestion method based on query semantics and click-through data
CN105045875B (en) Personalized search and device
US9928296B2 (en) Search lexicon expansion
CN104008109A (en) User interest based Web information push service system
JP6355840B2 (en) Stopword identification method and apparatus
CN103577416A (en) Query expansion method and system
CN1996316A (en) Search engine searching method based on web page correlation
Baliński et al. Re-ranking method based on inter-document distances
CN111522905A (en) Document searching method and device based on database
CN110807326B (en) Short text keyword extraction method combining GPU-DMM and text features
CN103020212A (en) Method and device for finding hot videos based on user query logs in real time
Maniu et al. Network-aware search in social tagging applications: Instance optimality versus efficiency
CN103020289A (en) Method for providing individual needs of search engine user based on log mining
CN115905489A (en) Method for providing bid and bid information search service
US8949254B1 (en) Enhancing the content and structure of a corpus of content
CN112800023B (en) Multi-model data distributed storage and hierarchical query method based on semantic classification
Klink Query reformulation with collaborative concept-based expansion
Jadidoleslamy Introduction to metasearch engines and result merging strategies: a survey

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20120704

Termination date: 20170430