CN103218443A - Blogging webpage retrieval system and retrieval method - Google Patents

Blogging webpage retrieval system and retrieval method Download PDF

Info

Publication number
CN103218443A
CN103218443A CN2013101417845A CN201310141784A CN103218443A CN 103218443 A CN103218443 A CN 103218443A CN 2013101417845 A CN2013101417845 A CN 2013101417845A CN 201310141784 A CN201310141784 A CN 201310141784A CN 103218443 A CN103218443 A CN 103218443A
Authority
CN
China
Prior art keywords
webpage
page
web
blog
index
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN2013101417845A
Other languages
Chinese (zh)
Inventor
罗笑南
曾金龙
林格
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sun Yat Sen University
Original Assignee
Sun Yat Sen University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sun Yat Sen University filed Critical Sun Yat Sen University
Priority to CN2013101417845A priority Critical patent/CN103218443A/en
Publication of CN103218443A publication Critical patent/CN103218443A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Information Transfer Between Computers (AREA)

Abstract

The invention discloses a blogging webpage retrieval system and a retrieval method. The blogging webpage retrieval system comprises an information extraction module, a data reduction module, an indexing module and a retrieval module, wherein the information extraction module is used for acquiring webpages relevant to a blog theme, the data reduction module is used for carrying out structuralized information extraction and de-duplication on the initial webpages acquired by the information extraction module, the indexing module is used for creating an index of data extracted by the data reduction module, and the retrieval module is used for providing a retrieval port for a user and carrying out retrieval according to the index and sorting retrieval results. According to the blogging webpage retrieval system and the retrieval method, data storage is mapped in a plurality of serves by means of the Hash mapping modular method, load balancing of all the storage servers can be well guaranteed, the webpages relevant to the blog theme searched by the user can be well returned by means of the blog theme relevance measuring method, searching accuracy can be effectively improved, and webpages irrelevant to the theme can be eliminated.

Description

A kind of web search system and method towards blog web page
Technical field
The present invention relates to the web search technical field, relate in particular to a kind of web search system and method towards blog web page.
Background technology
In the past few years, obtained great success based on the search engine of internet, and also obtained huge repayment based on the Google company that search engine is built up a family fortune, the advertising income of Google every day just surpasses 100,000,000 U.S. dollars.The domestic search engine of China is also because 360 and the Great War of Baidu and present the animated scene, and increasing company puts in the war of search engine, because it is the same with browser, all is the inlet of internet.Yet different companies all is to carry out strict maintaining secrecy to its core technology, and its implementation can't be known in the external world; And the performance of present search engine also respectively has strengths and weaknesses, respectively possesses some good points at different environment.
At present, the search engine on traditional internet is not well positioned to meet mobile environment, and in the field of segmentation, general is not best such as search engines such as Baidu, Google, and very big room for promotion is still arranged aspect search accuracy.Particularly in blog system, not having a kind of now is the search engine of developing at blog system fully, and at the relevant web search of blog title and unsatisfactory aspect reordering.
Present search engine does not have and can carry out the retrieval of related subject and reorder at the characteristic of blog system; But the same with general webpage, all be the extracting of carrying out webpage by the URL chain of controlling depth.Some and the irrelevant webpage of theme have also offered the user; And only be the tolerance of carrying out web page correlation with the matching degree of word frequency or single speech, can not react blog title veritably.
Summary of the invention
The objective of the invention is to overcome the deficiencies in the prior art, the invention provides a kind of web search system and method towards blog web page, can return the webpage relevant well, can effectively improve search accuracy, remove the irrelevant webpage of those themes with the blog title of user search.
In order to address the above problem, the present invention proposes a kind of web search system towards blog web page, described system comprises:
Information extraction modules is used to grasp the webpage relevant with blog title;
Data preparation module, the initial webpage that is used for that described information extraction modules is grasped carry out the structured message extraction and webpage disappears heavily;
Index module is used for the data that described data preparation module is extracted are set up index;
Retrieval module is used to provide the user search interface, retrieves according to described index, and the result of retrieval is sorted.
Preferably, described system also comprises web database, be used to preserve web pages downloaded and handle after data.
Preferably, described system also comprises system interface, and wherein, described system interface comprises internet web page interface and user search inlet.
Preferably, packet purse rope page data and the dictionary index data stored of described web database; Wherein, described web data comprises: webpage numbering, uniform resource position mark URL, title, synopsis, webpage size; Described dictionary index data comprises: the words in the Chinese vocabulary bank, English word, the formation of the corresponding webpage numbering of each words.
Correspondingly, the embodiment of the invention also discloses a kind of web search method towards blog web page, described method comprises:
Grasp the webpage relevant with blog title;
To the initial webpage that the is grasped weight that carries out that structured message extracts and webpage disappears;
The data of being extracted are set up index;
Retrieve according to described index, and the result of retrieval is sorted.
Preferably, the described step that the data of being extracted are set up index comprises:
Call the addDocument of IndexWriter;
Create a Document object;
In the Document object of creating, add and name each field Segment;
Call the addDocument method of DocumentWriter and in index, add document;
Segment information is preserved.
Preferably, after the step of the described extracting webpage relevant, also comprise: the webpage that is grasped is filtered with blog title.
Preferably, the described step that the webpage that is grasped is filtered comprises: the webpage that is grasped is carried out the degree of correlation evaluation of blog title; The lower webpage of the deletion degree of correlation.
Preferably, the described step that the webpage that is grasped is carried out the degree of correlation evaluation of blog title comprises:
The relevant subset page of describing theme is carried out obtaining and weighting of keyword, obtain to belong to the corresponding weight of vector sum vector of this theme feature;
Text to the page is carried out participle, removes those stop words, stays the keyword that needs;
Page title is carried out word segmentation processing, the keyword that obtains and the keyword in the Web page text are merged, and be weighted on the title keyword of acquisition;
Keyword in the page is adjusted and expanded according to the proper vector of theme;
Calculate the similarity sim(D of the page and theme, Di), wherein D is a theme, and Di is the page to be compared;
According to sim(D, Di) size of value and and threshold value d come comparison, if sim(D, Di) equal d greatly, then the page is relevant with theme, this webpage is remained in the storehouse of the theme page; Then delete this webpage on the contrary.
Preferably, describedly retrieve, and the step that the result of retrieval sorts comprised according to described index:
Search word is carried out word segmentation processing;
Find the webpage ID formation of each words by Hash;
To each webpage ID formation of finding do " with ", " or ", the logical operation of " non-", obtain last search result web page ID formation;
The result is sorted from big to small according to weight;
With the higher keyword of the similar weight of the keyword behind the participle, the similar keyword of first word segmentation result is shown;
Finish Pagination Display and handle, calculate each webpage ID formation that will show at last, find relevant structure of web page body memory contents by these webpages ID, display of search results is given the user.
In embodiments of the present invention, data storage is got surplus method by Hash mapping and is mapped in a plurality of servers, and can be good at guaranteeing the load balancing in each storage server; And the blog title relativity measurement method that is adopted can be returned the webpage relevant with the blog title of user search well, can effectively improve search accuracy, removes the irrelevant webpage of those themes.
Description of drawings
In order to be illustrated more clearly in the embodiment of the invention or technical scheme of the prior art, to do to introduce simply to the accompanying drawing of required use in embodiment or the description of the Prior Art below, apparently, accompanying drawing in describing below only is some embodiments of the present invention, for those of ordinary skills, under the prerequisite of not paying creative work, can also obtain other accompanying drawing according to these accompanying drawings.
Fig. 1 is that the structure towards the web search system of blog web page of the embodiment of the invention is formed synoptic diagram;
Fig. 2 is that the index employing is got the synoptic diagram that surplus method is carried out the distributed mapping storage by Hash in the embodiment of the invention;
The schematic flow sheet towards the web search method of blog web page of Fig. 3 embodiment of the invention;
Fig. 4 is the synoptic diagram of index aufbauprinciple in the embodiment of the invention;
Fig. 5 utilizes the Lucene technology to carry out the synoptic diagram of the structure of index in the embodiment of the invention;
Fig. 6 is the synoptic diagram that the degree of subject relativity of the page in the embodiment of the invention is differentiated flow process;
Fig. 7 is the synoptic diagram of retrieval process flow process in the embodiment of the invention.
Embodiment
Below in conjunction with the accompanying drawing in the embodiment of the invention, the technical scheme in the embodiment of the invention is clearly and completely described, obviously, described embodiment only is the present invention's part embodiment, rather than whole embodiment.Based on the embodiment among the present invention, those of ordinary skills belong to the scope of protection of the invention not making the every other embodiment that is obtained under the creative work prerequisite.
Fig. 1 is that the structure towards the web search system of blog web page of the embodiment of the invention is formed synoptic diagram, and as shown in Figure 1, this system comprises:
Information extraction modules 1 is used to grasp the webpage relevant with blog title;
Data preparation module 2 is used for the initial webpage that information extraction modules 2 the is grasped weight that carries out that structured message extracts and webpage disappears;
Index module 3 is used for the data that data sorting module 3 is extracted are set up index;
Retrieval module 4 is used to provide the user search interface, retrieves according to index, and the result of retrieval is sorted.
This system also comprises system interface, and wherein, system interface comprises internet web page interface 5 and user search inlet 6.
This system also comprises the web database (not shown), be used to preserve web pages downloaded and handle after data.Packet purse rope page data that this web database is stored and dictionary index data; Wherein, this web data comprises: and webpage numbering, URL(uniform resource locator) (Uniform Resource Locator, URL), title, synopsis, webpage size, as shown in table 1; This dictionary index data comprises: the words in the Chinese vocabulary bank, English word, the formation of the corresponding webpage numbering of each words, and as shown in table 2.The webpage numbering is to be used for direct presentation web page, and it is a unique number, therefore must not repeat.During inquiry, obtain the webpage numbering by the dictionary index, in web data, obtain the related data of webpage separately then, this index structure usually is called inverted index in search engine.
In order to obtain the relevant information of webpage fast, need search mechanisms fast, therefore carry out index, to index and the preservation one by one of the webpage on the content server by the most popular index technology: Lucene.Comprising the index information of describing webpage in the web content server, as long as in needs, sending searching request, the server search index can navigate to the position at web page files place fast, and the info web structure of definition is as shown in table 1.This structure is based on the Lucene technological development, and it is very similar that this table is stored as the table of document and lane database.The storage size that can know each web data thus is 592 bytes.
Table 1 info web structure
Field Describe Length
Index The webpage numbering Char16
URL Corresponding machine memory address Char256
Title Title Char56
Abstract Synopsis Char256
Size The webpage size Char8
After info web obtains, just need carry out index to webpage, for each piece web page contents, adopt the branch word algorithm of storage to handle, the speech that branches away is maximum point-score, conveniently can both set up index to each relevant words.All web page contents are all with good according to sort algorithm series arrangement from big to small, so the web page index formation of each words also is according to sort algorithm arrangement from big to small.All words in the dictionary all are to distribute according to Hash to arrange, and are convenient to can find fast behind the query word participle web results ID formation of words correspondence in each dictionary.The index structure that adopts is as shown in table 2.
Table 2 index structure
Field Describe Index whether
Key Key word behind the participle Be
Index The numbering of web content server Not
Key_Weight Weight in the same keyword Be
Title Web page title Not
URL The true address of webpage Not
According to the storage organization of table 2,, can realize ordering and search very easily again based on the high efficiency of Lucene.In the native system after web crawlers crawls into webpage, the resource resolver is the analyzing web page content in real time, after taking out web page title, Web page text etc., partial content is carried out Chinese cut speech, then the form of data with the Field field recorded among the Document, an index can be made up of a plurality of Documents, and can comprise some Field among a Document, only being required to be them specifies little FieldName together to get final product, when the user sends retrieval when requiring, be exactly the retrieval of the specific Field object that specific Document object is comprised in the chicken roost index in fact.
The index stores scheme that the present invention taked is distributed.With N platform machine, the index of distributed memory scan, this is in order to expand on the one hand, is to improve recall precision on the other hand.Specific embodiment has then adopted the mode of database horizontal partitioning.The horizontal partitioning meaning is assigned to different server with the data of same type.The rule of dividing is the key according to search, uses the hash function to the key value, has obtained an integer, next with this numerical value and server count are got surplus, press surplus result and distribute index.Its example as shown in Figure 2.Have two services to pass through Hash(Key) obtain a numerical value, get then surplus, i.e. Hash(key) %2, so just can be distributed to index in the two-server uniformly.
Fig. 3 is the schematic flow sheet towards the web search method of blog web page of the embodiment of the invention, and as shown in Figure 3, this method comprises:
S301 grasps the webpage relevant with blog title;
S302 is to the initial webpage that the is grasped weight that carries out that structured message extracts and webpage disappears;
S303 sets up index to the data of being extracted;
S304 retrieves according to index, and the result of retrieval is sorted.
The present invention utilizes the Lucen technology to carry out the structure (S303) of index, principle that it is concrete and process such as Fig. 4, shown in Figure 5.Specific as follows:
Step 1: from calling the addDocument method of IndexWriter, the major responsibility of IndexWriter is to add document in index, and it provides the main external interface of setting up index; Change step 2;
Step 2: create a Document object.
Step 3: in the Document object of creating, add and name each field Segment; Change step 4;
Step 4: the addDocument method of calling DocumentWriter is added document in index;
Step 5: Segment information is preserved, whether merge, then merge if desired, otherwise process direct preservation if a plurality of Segment are arranged then consider.
After S301, also comprise:
The webpage that is grasped is filtered.
Further, the step that the webpage that is grasped is filtered comprises:
The webpage that is grasped is carried out the degree of correlation evaluation of blog title;
The lower webpage of the deletion degree of correlation.
In order to improve the accuracy rate of gathering page correspondence, need carry out the evaluation of the degree of correlation of blog title to the page of having gathered, Here it is page filtering technique.The evaluation result page on the low side (just less than preset threshold) deletion, attention can improve the accuracy rate in the page of the theme of gathering.The elimination method that native system is taked is based on the vector space model algorithm of keyword.The alike degree of the page and theme can be measured with the angle of vector, and the words that angle is more little just can illustrate that similarity is high more more.The degree of subject relativity of the page is differentiated flow process as shown in Figure 6, idiographic flow as follows:
Step 1: at first be pretreatment stage: before gathering, the relevant subset page of describing theme is carried out obtaining and weighting of keyword, thereby can obtain belonging to the corresponding weight of vector sum vector of this theme feature, change step 2;
Step 2: the text to the page is carried out participle, removes those stop words, stays the keyword that needs.The subject concept of some keywords in the theme feature vector, calculate according to the different position that it occurs in article then, then weighted frequency simultaneously as needs; Change step 3;
Step 3: page title carries out word segmentation processing, the keyword that obtains and some keywords in the Web page text is merged, and be weighted on the title keyword of acquisition; Change step 4;
Step 4:, the keyword in the page is adjusted and expanded according to the proper vector of theme; Change step 5;
Step 5: calculate the similarity sim(D of the page and theme, Di), wherein D is a theme, and Di is the page to be compared; Change step 6;
Step 6: according to sim(D, Di) value size and and threshold value d come comparison, suppose sim(D, Di) equal d greatly, then the page is relevant with theme, this webpage is remained in the storehouse of the theme page; Then delete this webpage on the contrary.
The numerical value that can calculate a double precision by top algorithm is represented the topic relativity of a webpage.In the configuration file of system, set a filtration parameter, just the numerical value of a double precision is relevant if the correlation values of the webpage of being gathered, then proves this page theme greater than this parameter, can indexedly store.
Next retrieval flow among the present invention is described.When the user imports an inquiry, need carry out participle, in index data base, mate then, navigate to corresponding web data according to index then and return to the user this input.Specifically as shown in Figure 7, idiographic flow is as follows:
Step 1: search word is carried out word segmentation processing; Change step 2;
Step 2: the webpage ID formation of each words that finds by Hash; Change step 3;
Step 3: to each webpage ID formation of finding do " with ", " or ", the logical operation of " non-", obtain last search result web page ID formation; Change step 4;
Step 4: the result is sorted from big to small according to weight; Change step 5;
Step 5: with the higher keyword of the similar weight of the keyword behind the participle, the similar keyword of first word segmentation result is shown, other is incoherent then will not to show; Step 6;
Step 6: finish Pagination Display and handle, (general every page shows 10 during the internet hunt webpage to calculate each webpage ID formation that will show at last, so, this number mostly is 10 most), by these webpages ID, find relevant structure of web page body memory contents, display of search results is given the user.
In embodiments of the present invention, data storage is got surplus method by Hash mapping and is mapped in a plurality of servers, and can be good at guaranteeing the load balancing in each storage server; And the blog title relativity measurement method that is adopted can be returned the webpage relevant with the blog title of user search well, can effectively improve search accuracy, removes the irrelevant webpage of those themes.
One of ordinary skill in the art will appreciate that all or part of step in the whole bag of tricks of the foregoing description is to instruct relevant hardware to finish by program, this program can be stored in the computer-readable recording medium, storage medium can comprise: ROM (read-only memory) (ROM, Read Only Memory), random access memory (RAM, Random Access Memory), disk or CD etc.
In addition, more than the web search system and method towards blog web page that the embodiment of the invention provided is described in detail, used specific case herein principle of the present invention and embodiment are set forth, the explanation of above embodiment just is used for helping to understand method of the present invention and core concept thereof; Simultaneously, for one of ordinary skill in the art, according to thought of the present invention, the part that all can change in specific embodiments and applications, in sum, this description should not be construed as limitation of the present invention.

Claims (10)

1. web search system towards blog web page is characterized in that described system comprises:
Information extraction modules is used to grasp the webpage relevant with blog title;
Data preparation module, the initial webpage that is used for that described information extraction modules is grasped carry out the structured message extraction and webpage disappears heavily;
Index module is used for the data that described data preparation module is extracted are set up index;
Retrieval module is used to provide the user search interface, retrieves according to described index, and the result of retrieval is sorted.
2. the web search system towards blog web page as claimed in claim 1 is characterized in that described system also comprises web database, be used to preserve web pages downloaded and handle after data.
3. the web search system towards blog web page as claimed in claim 2 is characterized in that described system also comprises system interface, and wherein, described system interface comprises internet web page interface and user search inlet.
4. the web search system towards blog web page as claimed in claim 2 is characterized in that, packet purse rope page data that described web database is stored and dictionary index data; Wherein, described web data comprises: webpage numbering, uniform resource position mark URL, title, synopsis, webpage size; Described dictionary index data comprises: the words in the Chinese vocabulary bank, English word, the formation of the corresponding webpage numbering of each words.
5. web search method towards blog web page is characterized in that described method comprises:
Grasp the webpage relevant with blog title;
To the initial webpage that the is grasped weight that carries out that structured message extracts and webpage disappears;
The data of being extracted are set up index;
Retrieve according to described index, and the result of retrieval is sorted.
6. the web search method towards blog web page as claimed in claim 5 is characterized in that, the described step that the data of being extracted are set up index comprises:
Call the addDocument of IndexWriter;
Create a Document object;
In the Document object of creating, add and name each field Segment;
Call the addDocument method of DocumentWriter and in index, add document;
Segment information is preserved.
7. the web search method towards blog web page as claimed in claim 5 is characterized in that, also comprises after the step of the described extracting webpage relevant with blog title: the webpage that is grasped is filtered.
8. the web search method towards blog web page as claimed in claim 7 is characterized in that the described step that the webpage that is grasped is filtered comprises: the webpage that is grasped is carried out the degree of correlation evaluation of blog title; The lower webpage of the deletion degree of correlation.
9. the web search method towards blog web page as claimed in claim 8 is characterized in that, the described step that the webpage that is grasped is carried out the degree of correlation evaluation of blog title comprises:
The relevant subset page of describing theme is carried out obtaining and weighting of keyword, obtain to belong to the corresponding weight of vector sum vector of this theme feature;
Text to the page is carried out participle, removes those stop words, stays the keyword that needs;
Page title is carried out word segmentation processing, the keyword that obtains and the keyword in the Web page text are merged, and be weighted on the title keyword of acquisition;
Keyword in the page is adjusted and expanded according to the proper vector of theme;
Calculate the similarity sim(D of the page and theme, Di), wherein D is a theme, and Di is the page to be compared;
According to sim(D, Di) size of value and and threshold value d come comparison, if sim(D, Di) equal d greatly, then the page is relevant with theme, this webpage is remained in the storehouse of the theme page; Then delete this webpage on the contrary.
10. the web search method towards blog web page as claimed in claim 8 is characterized in that, describedly retrieves according to described index, and the step that the result of retrieval sorts is comprised:
Search word is carried out word segmentation processing;
Find the webpage ID formation of each words by Hash;
To each webpage ID formation of finding do " with ", " or ", the logical operation of " non-", obtain last search result web page ID formation;
The result is sorted from big to small according to weight;
With the higher keyword of the similar weight of the keyword behind the participle, the similar keyword of first word segmentation result is shown;
Finish Pagination Display and handle, calculate each webpage ID formation that will show at last, find relevant structure of web page body memory contents by these webpages ID, display of search results is given the user.
CN2013101417845A 2013-04-22 2013-04-22 Blogging webpage retrieval system and retrieval method Pending CN103218443A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2013101417845A CN103218443A (en) 2013-04-22 2013-04-22 Blogging webpage retrieval system and retrieval method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2013101417845A CN103218443A (en) 2013-04-22 2013-04-22 Blogging webpage retrieval system and retrieval method

Publications (1)

Publication Number Publication Date
CN103218443A true CN103218443A (en) 2013-07-24

Family

ID=48816230

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2013101417845A Pending CN103218443A (en) 2013-04-22 2013-04-22 Blogging webpage retrieval system and retrieval method

Country Status (1)

Country Link
CN (1) CN103218443A (en)

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103559270A (en) * 2013-11-04 2014-02-05 北京中搜网络技术股份有限公司 Method for storing and managing entries
CN104091280A (en) * 2014-07-21 2014-10-08 吴晨 Intelligent network marketing system
CN104376000A (en) * 2013-08-13 2015-02-25 阿里巴巴集团控股有限公司 Webpage attribute determination method and webpage attribute determination device
CN104516917A (en) * 2013-09-30 2015-04-15 腾讯科技(北京)有限公司 Method and device for acquiring community information
CN106446060A (en) * 2016-09-06 2017-02-22 北京易游华成科技有限公司 Information push and search device, method and system
CN107229714A (en) * 2017-05-31 2017-10-03 杭州宇为科技有限公司 A kind of full-text search engine based on distributed data base
CN107729323A (en) * 2017-11-29 2018-02-23 深圳中泓在线股份有限公司 Web documents similarity detection method and device, server and storage medium
CN109101635A (en) * 2018-08-16 2018-12-28 广州小鹏汽车科技有限公司 A kind of data processing method and device based on Redis Hash structure
CN109241505A (en) * 2018-10-09 2019-01-18 北京奔影网络科技有限公司 Text De-weight method and device
WO2019041500A1 (en) * 2017-08-28 2019-03-07 平安科技(深圳)有限公司 Pagination realization method and device, computer equipment and storage medium
CN109543060A (en) * 2018-10-25 2019-03-29 深圳壹账通智能科技有限公司 Methods of exhibiting, device and storage medium, the server of vehicle picture
CN110287288A (en) * 2019-06-18 2019-09-27 北京百度网讯科技有限公司 Recommend the method and apparatus of document

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101000632A (en) * 2007-01-11 2007-07-18 上海交通大学 Blog search and browsing system of intention driven
CN101127046A (en) * 2007-09-25 2008-02-20 腾讯科技(深圳)有限公司 Method and system for sequencing to blog article
US20110295844A1 (en) * 2010-05-27 2011-12-01 Microsoft Corporation Enhancing freshness of search results

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101000632A (en) * 2007-01-11 2007-07-18 上海交通大学 Blog search and browsing system of intention driven
CN101127046A (en) * 2007-09-25 2008-02-20 腾讯科技(深圳)有限公司 Method and system for sequencing to blog article
US20110295844A1 (en) * 2010-05-27 2011-12-01 Microsoft Corporation Enhancing freshness of search results

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
刘双林: "LUCENE实现的基于RSS的博客搜索引擎", 《中国优秀硕士学位论文全文数据库信息科技辑》, no. 6, 15 June 2009 (2009-06-15), pages 138 - 1120 *

Cited By (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104376000A (en) * 2013-08-13 2015-02-25 阿里巴巴集团控股有限公司 Webpage attribute determination method and webpage attribute determination device
CN104516917A (en) * 2013-09-30 2015-04-15 腾讯科技(北京)有限公司 Method and device for acquiring community information
CN104516917B (en) * 2013-09-30 2019-10-11 腾讯科技(北京)有限公司 A kind of method and device obtaining community information
CN103559270A (en) * 2013-11-04 2014-02-05 北京中搜网络技术股份有限公司 Method for storing and managing entries
CN104091280A (en) * 2014-07-21 2014-10-08 吴晨 Intelligent network marketing system
CN106446060A (en) * 2016-09-06 2017-02-22 北京易游华成科技有限公司 Information push and search device, method and system
CN107229714A (en) * 2017-05-31 2017-10-03 杭州宇为科技有限公司 A kind of full-text search engine based on distributed data base
CN107229714B (en) * 2017-05-31 2020-02-14 杭州宇为科技有限公司 Full-text search engine based on distributed database
WO2019041500A1 (en) * 2017-08-28 2019-03-07 平安科技(深圳)有限公司 Pagination realization method and device, computer equipment and storage medium
CN107729323A (en) * 2017-11-29 2018-02-23 深圳中泓在线股份有限公司 Web documents similarity detection method and device, server and storage medium
CN109101635A (en) * 2018-08-16 2018-12-28 广州小鹏汽车科技有限公司 A kind of data processing method and device based on Redis Hash structure
CN109101635B (en) * 2018-08-16 2020-09-11 广州小鹏汽车科技有限公司 Data processing method and device based on Redis Hash structure
CN109241505A (en) * 2018-10-09 2019-01-18 北京奔影网络科技有限公司 Text De-weight method and device
CN109543060A (en) * 2018-10-25 2019-03-29 深圳壹账通智能科技有限公司 Methods of exhibiting, device and storage medium, the server of vehicle picture
CN110287288A (en) * 2019-06-18 2019-09-27 北京百度网讯科技有限公司 Recommend the method and apparatus of document

Similar Documents

Publication Publication Date Title
CN103218443A (en) Blogging webpage retrieval system and retrieval method
CN104679778B (en) A kind of generation method and device of search result
CN102799647B (en) Method and device for webpage reduplication deletion
KR102080362B1 (en) Query expansion
CN101908071B (en) Method and device thereof for improving search efficiency of search engine
CN101694668B (en) Method and device for confirming web structure similarity
CN105183784B (en) Content-based spam webpage detection method and detection device thereof
JP2005085285A5 (en)
CN108319376B (en) Input association recommendation method and device for optimizing commercial word promotion
CN106503223B (en) online house source searching method and device combining position and keyword information
CN103838798B (en) Page classifications system and page classifications method
CN104866572A (en) Method for clustering network-based short texts
CN107844565A (en) product search method and device
CN105512143A (en) Method and device for web page classification
CN103324666A (en) Topic tracing method and device based on micro-blog data
CN102789452A (en) Similar content extraction method
CN101261629A (en) Specific information searching method based on automatic classification technology
CN110543595A (en) in-station search system and method
CN106844482B (en) Search engine-based retrieval information matching method and device
CN105512333A (en) Product comment theme searching method based on emotional tendency
CN104834736A (en) Method and device for establishing index database and retrieval method, device and system
CN102411617A (en) Method for storing and inquiring a large quantity of URLs
CN106294358A (en) The search method of a kind of information and system
CN113065070A (en) Intelligent sorting method, system, equipment and computer storage medium for mobile internet information search and retrieval
KR100671077B1 (en) Server, Method and System for Providing Information Search Service by Using Sheaf of Pages

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C12 Rejection of a patent application after its publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20130724