CN103218443A - Blogging webpage retrieval system and retrieval method - Google Patents
Blogging webpage retrieval system and retrieval method Download PDFInfo
- Publication number
- CN103218443A CN103218443A CN2013101417845A CN201310141784A CN103218443A CN 103218443 A CN103218443 A CN 103218443A CN 2013101417845 A CN2013101417845 A CN 2013101417845A CN 201310141784 A CN201310141784 A CN 201310141784A CN 103218443 A CN103218443 A CN 103218443A
- Authority
- CN
- China
- Prior art keywords
- webpage
- page
- web
- blog
- index
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 37
- 238000000605 extraction Methods 0.000 claims abstract description 12
- 230000015572 biosynthetic process Effects 0.000 claims description 17
- 230000011218 segmentation Effects 0.000 claims description 10
- 238000011156 evaluation Methods 0.000 claims description 7
- 238000012545 processing Methods 0.000 claims description 6
- 238000002360 preparation method Methods 0.000 claims description 5
- 238000012217 deletion Methods 0.000 claims description 4
- 230000037430 deletion Effects 0.000 claims description 4
- 239000000284 extract Substances 0.000 claims description 4
- 238000013507 mapping Methods 0.000 abstract description 4
- 238000013500 data storage Methods 0.000 abstract description 3
- 230000008859 change Effects 0.000 description 12
- 238000010586 diagram Methods 0.000 description 7
- 230000008569 process Effects 0.000 description 6
- 238000001914 filtration Methods 0.000 description 2
- 238000000691 measurement method Methods 0.000 description 2
- 238000004321 preservation Methods 0.000 description 2
- 238000000638 solvent extraction Methods 0.000 description 2
- 241000287828 Gallus gallus Species 0.000 description 1
- 230000007812 deficiency Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 230000008030 elimination Effects 0.000 description 1
- 238000003379 elimination reaction Methods 0.000 description 1
- KWORUUGOSLYAGD-YPPDDXJESA-N esomeprazole magnesium Chemical compound [Mg+2].C([S@](=O)C=1[N-]C2=CC=C(C=C2N=1)OC)C1=NC=C(C)C(OC)=C1C.C([S@](=O)C=1[N-]C2=CC=C(C=C2N=1)OC)C1=NC=C(C)C(OC)=C1C KWORUUGOSLYAGD-YPPDDXJESA-N 0.000 description 1
- 230000003203 everyday effect Effects 0.000 description 1
- 239000012467 final product Substances 0.000 description 1
- 230000006870 function Effects 0.000 description 1
- 230000008676 import Effects 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 230000008520 organization Effects 0.000 description 1
Images
Landscapes
- Information Transfer Between Computers (AREA)
Abstract
The invention discloses a blogging webpage retrieval system and a retrieval method. The blogging webpage retrieval system comprises an information extraction module, a data reduction module, an indexing module and a retrieval module, wherein the information extraction module is used for acquiring webpages relevant to a blog theme, the data reduction module is used for carrying out structuralized information extraction and de-duplication on the initial webpages acquired by the information extraction module, the indexing module is used for creating an index of data extracted by the data reduction module, and the retrieval module is used for providing a retrieval port for a user and carrying out retrieval according to the index and sorting retrieval results. According to the blogging webpage retrieval system and the retrieval method, data storage is mapped in a plurality of serves by means of the Hash mapping modular method, load balancing of all the storage servers can be well guaranteed, the webpages relevant to the blog theme searched by the user can be well returned by means of the blog theme relevance measuring method, searching accuracy can be effectively improved, and webpages irrelevant to the theme can be eliminated.
Description
Technical field
The present invention relates to the web search technical field, relate in particular to a kind of web search system and method towards blog web page.
Background technology
In the past few years, obtained great success based on the search engine of internet, and also obtained huge repayment based on the Google company that search engine is built up a family fortune, the advertising income of Google every day just surpasses 100,000,000 U.S. dollars.The domestic search engine of China is also because 360 and the Great War of Baidu and present the animated scene, and increasing company puts in the war of search engine, because it is the same with browser, all is the inlet of internet.Yet different companies all is to carry out strict maintaining secrecy to its core technology, and its implementation can't be known in the external world; And the performance of present search engine also respectively has strengths and weaknesses, respectively possesses some good points at different environment.
At present, the search engine on traditional internet is not well positioned to meet mobile environment, and in the field of segmentation, general is not best such as search engines such as Baidu, Google, and very big room for promotion is still arranged aspect search accuracy.Particularly in blog system, not having a kind of now is the search engine of developing at blog system fully, and at the relevant web search of blog title and unsatisfactory aspect reordering.
Present search engine does not have and can carry out the retrieval of related subject and reorder at the characteristic of blog system; But the same with general webpage, all be the extracting of carrying out webpage by the URL chain of controlling depth.Some and the irrelevant webpage of theme have also offered the user; And only be the tolerance of carrying out web page correlation with the matching degree of word frequency or single speech, can not react blog title veritably.
Summary of the invention
The objective of the invention is to overcome the deficiencies in the prior art, the invention provides a kind of web search system and method towards blog web page, can return the webpage relevant well, can effectively improve search accuracy, remove the irrelevant webpage of those themes with the blog title of user search.
In order to address the above problem, the present invention proposes a kind of web search system towards blog web page, described system comprises:
Information extraction modules is used to grasp the webpage relevant with blog title;
Data preparation module, the initial webpage that is used for that described information extraction modules is grasped carry out the structured message extraction and webpage disappears heavily;
Index module is used for the data that described data preparation module is extracted are set up index;
Retrieval module is used to provide the user search interface, retrieves according to described index, and the result of retrieval is sorted.
Preferably, described system also comprises web database, be used to preserve web pages downloaded and handle after data.
Preferably, described system also comprises system interface, and wherein, described system interface comprises internet web page interface and user search inlet.
Preferably, packet purse rope page data and the dictionary index data stored of described web database; Wherein, described web data comprises: webpage numbering, uniform resource position mark URL, title, synopsis, webpage size; Described dictionary index data comprises: the words in the Chinese vocabulary bank, English word, the formation of the corresponding webpage numbering of each words.
Correspondingly, the embodiment of the invention also discloses a kind of web search method towards blog web page, described method comprises:
Grasp the webpage relevant with blog title;
To the initial webpage that the is grasped weight that carries out that structured message extracts and webpage disappears;
The data of being extracted are set up index;
Retrieve according to described index, and the result of retrieval is sorted.
Preferably, the described step that the data of being extracted are set up index comprises:
Call the addDocument of IndexWriter;
Create a Document object;
In the Document object of creating, add and name each field Segment;
Call the addDocument method of DocumentWriter and in index, add document;
Segment information is preserved.
Preferably, after the step of the described extracting webpage relevant, also comprise: the webpage that is grasped is filtered with blog title.
Preferably, the described step that the webpage that is grasped is filtered comprises: the webpage that is grasped is carried out the degree of correlation evaluation of blog title; The lower webpage of the deletion degree of correlation.
Preferably, the described step that the webpage that is grasped is carried out the degree of correlation evaluation of blog title comprises:
The relevant subset page of describing theme is carried out obtaining and weighting of keyword, obtain to belong to the corresponding weight of vector sum vector of this theme feature;
Text to the page is carried out participle, removes those stop words, stays the keyword that needs;
Page title is carried out word segmentation processing, the keyword that obtains and the keyword in the Web page text are merged, and be weighted on the title keyword of acquisition;
Keyword in the page is adjusted and expanded according to the proper vector of theme;
Calculate the similarity sim(D of the page and theme, Di), wherein D is a theme, and Di is the page to be compared;
According to sim(D, Di) size of value and and threshold value d come comparison, if sim(D, Di) equal d greatly, then the page is relevant with theme, this webpage is remained in the storehouse of the theme page; Then delete this webpage on the contrary.
Preferably, describedly retrieve, and the step that the result of retrieval sorts comprised according to described index:
Search word is carried out word segmentation processing;
Find the webpage ID formation of each words by Hash;
To each webpage ID formation of finding do " with ", " or ", the logical operation of " non-", obtain last search result web page ID formation;
The result is sorted from big to small according to weight;
With the higher keyword of the similar weight of the keyword behind the participle, the similar keyword of first word segmentation result is shown;
Finish Pagination Display and handle, calculate each webpage ID formation that will show at last, find relevant structure of web page body memory contents by these webpages ID, display of search results is given the user.
In embodiments of the present invention, data storage is got surplus method by Hash mapping and is mapped in a plurality of servers, and can be good at guaranteeing the load balancing in each storage server; And the blog title relativity measurement method that is adopted can be returned the webpage relevant with the blog title of user search well, can effectively improve search accuracy, removes the irrelevant webpage of those themes.
Description of drawings
In order to be illustrated more clearly in the embodiment of the invention or technical scheme of the prior art, to do to introduce simply to the accompanying drawing of required use in embodiment or the description of the Prior Art below, apparently, accompanying drawing in describing below only is some embodiments of the present invention, for those of ordinary skills, under the prerequisite of not paying creative work, can also obtain other accompanying drawing according to these accompanying drawings.
Fig. 1 is that the structure towards the web search system of blog web page of the embodiment of the invention is formed synoptic diagram;
Fig. 2 is that the index employing is got the synoptic diagram that surplus method is carried out the distributed mapping storage by Hash in the embodiment of the invention;
The schematic flow sheet towards the web search method of blog web page of Fig. 3 embodiment of the invention;
Fig. 4 is the synoptic diagram of index aufbauprinciple in the embodiment of the invention;
Fig. 5 utilizes the Lucene technology to carry out the synoptic diagram of the structure of index in the embodiment of the invention;
Fig. 6 is the synoptic diagram that the degree of subject relativity of the page in the embodiment of the invention is differentiated flow process;
Fig. 7 is the synoptic diagram of retrieval process flow process in the embodiment of the invention.
Embodiment
Below in conjunction with the accompanying drawing in the embodiment of the invention, the technical scheme in the embodiment of the invention is clearly and completely described, obviously, described embodiment only is the present invention's part embodiment, rather than whole embodiment.Based on the embodiment among the present invention, those of ordinary skills belong to the scope of protection of the invention not making the every other embodiment that is obtained under the creative work prerequisite.
Fig. 1 is that the structure towards the web search system of blog web page of the embodiment of the invention is formed synoptic diagram, and as shown in Figure 1, this system comprises:
Index module 3 is used for the data that data sorting module 3 is extracted are set up index;
Retrieval module 4 is used to provide the user search interface, retrieves according to index, and the result of retrieval is sorted.
This system also comprises system interface, and wherein, system interface comprises internet web page interface 5 and user search inlet 6.
This system also comprises the web database (not shown), be used to preserve web pages downloaded and handle after data.Packet purse rope page data that this web database is stored and dictionary index data; Wherein, this web data comprises: and webpage numbering, URL(uniform resource locator) (Uniform Resource Locator, URL), title, synopsis, webpage size, as shown in table 1; This dictionary index data comprises: the words in the Chinese vocabulary bank, English word, the formation of the corresponding webpage numbering of each words, and as shown in table 2.The webpage numbering is to be used for direct presentation web page, and it is a unique number, therefore must not repeat.During inquiry, obtain the webpage numbering by the dictionary index, in web data, obtain the related data of webpage separately then, this index structure usually is called inverted index in search engine.
In order to obtain the relevant information of webpage fast, need search mechanisms fast, therefore carry out index, to index and the preservation one by one of the webpage on the content server by the most popular index technology: Lucene.Comprising the index information of describing webpage in the web content server, as long as in needs, sending searching request, the server search index can navigate to the position at web page files place fast, and the info web structure of definition is as shown in table 1.This structure is based on the Lucene technological development, and it is very similar that this table is stored as the table of document and lane database.The storage size that can know each web data thus is 592 bytes.
Table 1 info web structure
Field | Describe | Length |
Index | The webpage numbering | Char16 |
URL | Corresponding machine memory address | Char256 |
Title | Title | Char56 |
Abstract | Synopsis | Char256 |
Size | The webpage size | Char8 |
After info web obtains, just need carry out index to webpage, for each piece web page contents, adopt the branch word algorithm of storage to handle, the speech that branches away is maximum point-score, conveniently can both set up index to each relevant words.All web page contents are all with good according to sort algorithm series arrangement from big to small, so the web page index formation of each words also is according to sort algorithm arrangement from big to small.All words in the dictionary all are to distribute according to Hash to arrange, and are convenient to can find fast behind the query word participle web results ID formation of words correspondence in each dictionary.The index structure that adopts is as shown in table 2.
Table 2 index structure
Field | Describe | Index whether |
Key | Key word behind the participle | Be |
Index | The numbering of web content server | Not |
Key_Weight | Weight in the same keyword | Be |
Title | Web page title | Not |
URL | The true address of webpage | Not |
According to the storage organization of table 2,, can realize ordering and search very easily again based on the high efficiency of Lucene.In the native system after web crawlers crawls into webpage, the resource resolver is the analyzing web page content in real time, after taking out web page title, Web page text etc., partial content is carried out Chinese cut speech, then the form of data with the Field field recorded among the Document, an index can be made up of a plurality of Documents, and can comprise some Field among a Document, only being required to be them specifies little FieldName together to get final product, when the user sends retrieval when requiring, be exactly the retrieval of the specific Field object that specific Document object is comprised in the chicken roost index in fact.
The index stores scheme that the present invention taked is distributed.With N platform machine, the index of distributed memory scan, this is in order to expand on the one hand, is to improve recall precision on the other hand.Specific embodiment has then adopted the mode of database horizontal partitioning.The horizontal partitioning meaning is assigned to different server with the data of same type.The rule of dividing is the key according to search, uses the hash function to the key value, has obtained an integer, next with this numerical value and server count are got surplus, press surplus result and distribute index.Its example as shown in Figure 2.Have two services to pass through Hash(Key) obtain a numerical value, get then surplus, i.e. Hash(key) %2, so just can be distributed to index in the two-server uniformly.
Fig. 3 is the schematic flow sheet towards the web search method of blog web page of the embodiment of the invention, and as shown in Figure 3, this method comprises:
S301 grasps the webpage relevant with blog title;
S302 is to the initial webpage that the is grasped weight that carries out that structured message extracts and webpage disappears;
S303 sets up index to the data of being extracted;
S304 retrieves according to index, and the result of retrieval is sorted.
The present invention utilizes the Lucen technology to carry out the structure (S303) of index, principle that it is concrete and process such as Fig. 4, shown in Figure 5.Specific as follows:
Step 1: from calling the addDocument method of IndexWriter, the major responsibility of IndexWriter is to add document in index, and it provides the main external interface of setting up index; Change step 2;
Step 2: create a Document object.
Step 3: in the Document object of creating, add and name each field Segment; Change step 4;
Step 4: the addDocument method of calling DocumentWriter is added document in index;
Step 5: Segment information is preserved, whether merge, then merge if desired, otherwise process direct preservation if a plurality of Segment are arranged then consider.
After S301, also comprise:
The webpage that is grasped is filtered.
Further, the step that the webpage that is grasped is filtered comprises:
The webpage that is grasped is carried out the degree of correlation evaluation of blog title;
The lower webpage of the deletion degree of correlation.
In order to improve the accuracy rate of gathering page correspondence, need carry out the evaluation of the degree of correlation of blog title to the page of having gathered, Here it is page filtering technique.The evaluation result page on the low side (just less than preset threshold) deletion, attention can improve the accuracy rate in the page of the theme of gathering.The elimination method that native system is taked is based on the vector space model algorithm of keyword.The alike degree of the page and theme can be measured with the angle of vector, and the words that angle is more little just can illustrate that similarity is high more more.The degree of subject relativity of the page is differentiated flow process as shown in Figure 6, idiographic flow as follows:
Step 1: at first be pretreatment stage: before gathering, the relevant subset page of describing theme is carried out obtaining and weighting of keyword, thereby can obtain belonging to the corresponding weight of vector sum vector of this theme feature, change step 2;
Step 2: the text to the page is carried out participle, removes those stop words, stays the keyword that needs.The subject concept of some keywords in the theme feature vector, calculate according to the different position that it occurs in article then, then weighted frequency simultaneously as needs; Change step 3;
Step 3: page title carries out word segmentation processing, the keyword that obtains and some keywords in the Web page text is merged, and be weighted on the title keyword of acquisition; Change step 4;
Step 4:, the keyword in the page is adjusted and expanded according to the proper vector of theme; Change step 5;
Step 5: calculate the similarity sim(D of the page and theme, Di), wherein D is a theme, and Di is the page to be compared; Change step 6;
Step 6: according to sim(D, Di) value size and and threshold value d come comparison, suppose sim(D, Di) equal d greatly, then the page is relevant with theme, this webpage is remained in the storehouse of the theme page; Then delete this webpage on the contrary.
The numerical value that can calculate a double precision by top algorithm is represented the topic relativity of a webpage.In the configuration file of system, set a filtration parameter, just the numerical value of a double precision is relevant if the correlation values of the webpage of being gathered, then proves this page theme greater than this parameter, can indexedly store.
Next retrieval flow among the present invention is described.When the user imports an inquiry, need carry out participle, in index data base, mate then, navigate to corresponding web data according to index then and return to the user this input.Specifically as shown in Figure 7, idiographic flow is as follows:
Step 1: search word is carried out word segmentation processing; Change step 2;
Step 2: the webpage ID formation of each words that finds by Hash; Change step 3;
Step 3: to each webpage ID formation of finding do " with ", " or ", the logical operation of " non-", obtain last search result web page ID formation; Change step 4;
Step 4: the result is sorted from big to small according to weight; Change step 5;
Step 5: with the higher keyword of the similar weight of the keyword behind the participle, the similar keyword of first word segmentation result is shown, other is incoherent then will not to show; Step 6;
Step 6: finish Pagination Display and handle, (general every page shows 10 during the internet hunt webpage to calculate each webpage ID formation that will show at last, so, this number mostly is 10 most), by these webpages ID, find relevant structure of web page body memory contents, display of search results is given the user.
In embodiments of the present invention, data storage is got surplus method by Hash mapping and is mapped in a plurality of servers, and can be good at guaranteeing the load balancing in each storage server; And the blog title relativity measurement method that is adopted can be returned the webpage relevant with the blog title of user search well, can effectively improve search accuracy, removes the irrelevant webpage of those themes.
One of ordinary skill in the art will appreciate that all or part of step in the whole bag of tricks of the foregoing description is to instruct relevant hardware to finish by program, this program can be stored in the computer-readable recording medium, storage medium can comprise: ROM (read-only memory) (ROM, Read Only Memory), random access memory (RAM, Random Access Memory), disk or CD etc.
In addition, more than the web search system and method towards blog web page that the embodiment of the invention provided is described in detail, used specific case herein principle of the present invention and embodiment are set forth, the explanation of above embodiment just is used for helping to understand method of the present invention and core concept thereof; Simultaneously, for one of ordinary skill in the art, according to thought of the present invention, the part that all can change in specific embodiments and applications, in sum, this description should not be construed as limitation of the present invention.
Claims (10)
1. web search system towards blog web page is characterized in that described system comprises:
Information extraction modules is used to grasp the webpage relevant with blog title;
Data preparation module, the initial webpage that is used for that described information extraction modules is grasped carry out the structured message extraction and webpage disappears heavily;
Index module is used for the data that described data preparation module is extracted are set up index;
Retrieval module is used to provide the user search interface, retrieves according to described index, and the result of retrieval is sorted.
2. the web search system towards blog web page as claimed in claim 1 is characterized in that described system also comprises web database, be used to preserve web pages downloaded and handle after data.
3. the web search system towards blog web page as claimed in claim 2 is characterized in that described system also comprises system interface, and wherein, described system interface comprises internet web page interface and user search inlet.
4. the web search system towards blog web page as claimed in claim 2 is characterized in that, packet purse rope page data that described web database is stored and dictionary index data; Wherein, described web data comprises: webpage numbering, uniform resource position mark URL, title, synopsis, webpage size; Described dictionary index data comprises: the words in the Chinese vocabulary bank, English word, the formation of the corresponding webpage numbering of each words.
5. web search method towards blog web page is characterized in that described method comprises:
Grasp the webpage relevant with blog title;
To the initial webpage that the is grasped weight that carries out that structured message extracts and webpage disappears;
The data of being extracted are set up index;
Retrieve according to described index, and the result of retrieval is sorted.
6. the web search method towards blog web page as claimed in claim 5 is characterized in that, the described step that the data of being extracted are set up index comprises:
Call the addDocument of IndexWriter;
Create a Document object;
In the Document object of creating, add and name each field Segment;
Call the addDocument method of DocumentWriter and in index, add document;
Segment information is preserved.
7. the web search method towards blog web page as claimed in claim 5 is characterized in that, also comprises after the step of the described extracting webpage relevant with blog title: the webpage that is grasped is filtered.
8. the web search method towards blog web page as claimed in claim 7 is characterized in that the described step that the webpage that is grasped is filtered comprises: the webpage that is grasped is carried out the degree of correlation evaluation of blog title; The lower webpage of the deletion degree of correlation.
9. the web search method towards blog web page as claimed in claim 8 is characterized in that, the described step that the webpage that is grasped is carried out the degree of correlation evaluation of blog title comprises:
The relevant subset page of describing theme is carried out obtaining and weighting of keyword, obtain to belong to the corresponding weight of vector sum vector of this theme feature;
Text to the page is carried out participle, removes those stop words, stays the keyword that needs;
Page title is carried out word segmentation processing, the keyword that obtains and the keyword in the Web page text are merged, and be weighted on the title keyword of acquisition;
Keyword in the page is adjusted and expanded according to the proper vector of theme;
Calculate the similarity sim(D of the page and theme, Di), wherein D is a theme, and Di is the page to be compared;
According to sim(D, Di) size of value and and threshold value d come comparison, if sim(D, Di) equal d greatly, then the page is relevant with theme, this webpage is remained in the storehouse of the theme page; Then delete this webpage on the contrary.
10. the web search method towards blog web page as claimed in claim 8 is characterized in that, describedly retrieves according to described index, and the step that the result of retrieval sorts is comprised:
Search word is carried out word segmentation processing;
Find the webpage ID formation of each words by Hash;
To each webpage ID formation of finding do " with ", " or ", the logical operation of " non-", obtain last search result web page ID formation;
The result is sorted from big to small according to weight;
With the higher keyword of the similar weight of the keyword behind the participle, the similar keyword of first word segmentation result is shown;
Finish Pagination Display and handle, calculate each webpage ID formation that will show at last, find relevant structure of web page body memory contents by these webpages ID, display of search results is given the user.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN2013101417845A CN103218443A (en) | 2013-04-22 | 2013-04-22 | Blogging webpage retrieval system and retrieval method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN2013101417845A CN103218443A (en) | 2013-04-22 | 2013-04-22 | Blogging webpage retrieval system and retrieval method |
Publications (1)
Publication Number | Publication Date |
---|---|
CN103218443A true CN103218443A (en) | 2013-07-24 |
Family
ID=48816230
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN2013101417845A Pending CN103218443A (en) | 2013-04-22 | 2013-04-22 | Blogging webpage retrieval system and retrieval method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN103218443A (en) |
Cited By (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103559270A (en) * | 2013-11-04 | 2014-02-05 | 北京中搜网络技术股份有限公司 | Method for storing and managing entries |
CN104091280A (en) * | 2014-07-21 | 2014-10-08 | 吴晨 | Intelligent network marketing system |
CN104376000A (en) * | 2013-08-13 | 2015-02-25 | 阿里巴巴集团控股有限公司 | Webpage attribute determination method and webpage attribute determination device |
CN104516917A (en) * | 2013-09-30 | 2015-04-15 | 腾讯科技(北京)有限公司 | Method and device for acquiring community information |
CN106446060A (en) * | 2016-09-06 | 2017-02-22 | 北京易游华成科技有限公司 | Information push and search device, method and system |
CN107229714A (en) * | 2017-05-31 | 2017-10-03 | 杭州宇为科技有限公司 | A kind of full-text search engine based on distributed data base |
CN107729323A (en) * | 2017-11-29 | 2018-02-23 | 深圳中泓在线股份有限公司 | Web documents similarity detection method and device, server and storage medium |
CN109101635A (en) * | 2018-08-16 | 2018-12-28 | 广州小鹏汽车科技有限公司 | A kind of data processing method and device based on Redis Hash structure |
CN109241505A (en) * | 2018-10-09 | 2019-01-18 | 北京奔影网络科技有限公司 | Text De-weight method and device |
WO2019041500A1 (en) * | 2017-08-28 | 2019-03-07 | 平安科技(深圳)有限公司 | Pagination realization method and device, computer equipment and storage medium |
CN109543060A (en) * | 2018-10-25 | 2019-03-29 | 深圳壹账通智能科技有限公司 | Methods of exhibiting, device and storage medium, the server of vehicle picture |
CN110287288A (en) * | 2019-06-18 | 2019-09-27 | 北京百度网讯科技有限公司 | Recommend the method and apparatus of document |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101000632A (en) * | 2007-01-11 | 2007-07-18 | 上海交通大学 | Blog search and browsing system of intention driven |
CN101127046A (en) * | 2007-09-25 | 2008-02-20 | 腾讯科技(深圳)有限公司 | Method and system for sequencing to blog article |
US20110295844A1 (en) * | 2010-05-27 | 2011-12-01 | Microsoft Corporation | Enhancing freshness of search results |
-
2013
- 2013-04-22 CN CN2013101417845A patent/CN103218443A/en active Pending
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101000632A (en) * | 2007-01-11 | 2007-07-18 | 上海交通大学 | Blog search and browsing system of intention driven |
CN101127046A (en) * | 2007-09-25 | 2008-02-20 | 腾讯科技(深圳)有限公司 | Method and system for sequencing to blog article |
US20110295844A1 (en) * | 2010-05-27 | 2011-12-01 | Microsoft Corporation | Enhancing freshness of search results |
Non-Patent Citations (1)
Title |
---|
刘双林: "LUCENE实现的基于RSS的博客搜索引擎", 《中国优秀硕士学位论文全文数据库信息科技辑》, no. 6, 15 June 2009 (2009-06-15), pages 138 - 1120 * |
Cited By (15)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104376000A (en) * | 2013-08-13 | 2015-02-25 | 阿里巴巴集团控股有限公司 | Webpage attribute determination method and webpage attribute determination device |
CN104516917A (en) * | 2013-09-30 | 2015-04-15 | 腾讯科技(北京)有限公司 | Method and device for acquiring community information |
CN104516917B (en) * | 2013-09-30 | 2019-10-11 | 腾讯科技(北京)有限公司 | A kind of method and device obtaining community information |
CN103559270A (en) * | 2013-11-04 | 2014-02-05 | 北京中搜网络技术股份有限公司 | Method for storing and managing entries |
CN104091280A (en) * | 2014-07-21 | 2014-10-08 | 吴晨 | Intelligent network marketing system |
CN106446060A (en) * | 2016-09-06 | 2017-02-22 | 北京易游华成科技有限公司 | Information push and search device, method and system |
CN107229714A (en) * | 2017-05-31 | 2017-10-03 | 杭州宇为科技有限公司 | A kind of full-text search engine based on distributed data base |
CN107229714B (en) * | 2017-05-31 | 2020-02-14 | 杭州宇为科技有限公司 | Full-text search engine based on distributed database |
WO2019041500A1 (en) * | 2017-08-28 | 2019-03-07 | 平安科技(深圳)有限公司 | Pagination realization method and device, computer equipment and storage medium |
CN107729323A (en) * | 2017-11-29 | 2018-02-23 | 深圳中泓在线股份有限公司 | Web documents similarity detection method and device, server and storage medium |
CN109101635A (en) * | 2018-08-16 | 2018-12-28 | 广州小鹏汽车科技有限公司 | A kind of data processing method and device based on Redis Hash structure |
CN109101635B (en) * | 2018-08-16 | 2020-09-11 | 广州小鹏汽车科技有限公司 | Data processing method and device based on Redis Hash structure |
CN109241505A (en) * | 2018-10-09 | 2019-01-18 | 北京奔影网络科技有限公司 | Text De-weight method and device |
CN109543060A (en) * | 2018-10-25 | 2019-03-29 | 深圳壹账通智能科技有限公司 | Methods of exhibiting, device and storage medium, the server of vehicle picture |
CN110287288A (en) * | 2019-06-18 | 2019-09-27 | 北京百度网讯科技有限公司 | Recommend the method and apparatus of document |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN103218443A (en) | Blogging webpage retrieval system and retrieval method | |
CN104679778B (en) | A kind of generation method and device of search result | |
CN102799647B (en) | Method and device for webpage reduplication deletion | |
KR102080362B1 (en) | Query expansion | |
CN101908071B (en) | Method and device thereof for improving search efficiency of search engine | |
CN101694668B (en) | Method and device for confirming web structure similarity | |
CN105183784B (en) | Content-based spam webpage detection method and detection device thereof | |
JP2005085285A5 (en) | ||
CN108319376B (en) | Input association recommendation method and device for optimizing commercial word promotion | |
CN106503223B (en) | online house source searching method and device combining position and keyword information | |
CN103838798B (en) | Page classifications system and page classifications method | |
CN104866572A (en) | Method for clustering network-based short texts | |
CN107844565A (en) | product search method and device | |
CN105512143A (en) | Method and device for web page classification | |
CN103324666A (en) | Topic tracing method and device based on micro-blog data | |
CN102789452A (en) | Similar content extraction method | |
CN101261629A (en) | Specific information searching method based on automatic classification technology | |
CN110543595A (en) | in-station search system and method | |
CN106844482B (en) | Search engine-based retrieval information matching method and device | |
CN105512333A (en) | Product comment theme searching method based on emotional tendency | |
CN104834736A (en) | Method and device for establishing index database and retrieval method, device and system | |
CN102411617A (en) | Method for storing and inquiring a large quantity of URLs | |
CN106294358A (en) | The search method of a kind of information and system | |
CN113065070A (en) | Intelligent sorting method, system, equipment and computer storage medium for mobile internet information search and retrieval | |
KR100671077B1 (en) | Server, Method and System for Providing Information Search Service by Using Sheaf of Pages |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
C12 | Rejection of a patent application after its publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20130724 |