CN100478962C - Method, device and system for searching web page and device for establishing index database - Google Patents

Method, device and system for searching web page and device for establishing index database Download PDF

Info

Publication number
CN100478962C
CN100478962C CNB200710136345XA CN200710136345A CN100478962C CN 100478962 C CN100478962 C CN 100478962C CN B200710136345X A CNB200710136345X A CN B200710136345XA CN 200710136345 A CN200710136345 A CN 200710136345A CN 100478962 C CN100478962 C CN 100478962C
Authority
CN
China
Prior art keywords
forum
clue
sign
information
search
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CNB200710136345XA
Other languages
Chinese (zh)
Other versions
CN101101605A (en
Inventor
王伟
李自军
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Jingdong Shangke Information Technology Co Ltd
Original Assignee
Huawei Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huawei Technologies Co Ltd filed Critical Huawei Technologies Co Ltd
Priority to CNB200710136345XA priority Critical patent/CN100478962C/en
Publication of CN101101605A publication Critical patent/CN101101605A/en
Application granted granted Critical
Publication of CN100478962C publication Critical patent/CN100478962C/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Information Transfer Between Computers (AREA)

Abstract

Using thread of forum, the invention analyzes indexes of web page of forum. The method includes steps: obtaining querying words of user; searching threads of forum corresponding to querying words of user from preset index database; formatting the searched threads, and outputting out formatted threads of forum. The invention also discloses corresponding devices and system for searching web pages as well as device for building index database. Based on querying words of user, the invention returns indexes of forum corresponding to the querying words to the user. Thus. User obtains queried result by using index of forum as unit instead of traditional web page of forum as unit of queried result. The invention returns more accurate queried result to user.

Description

The method of search and webpage, Apparatus and system and set up the device of index data base
Technical field
The present invention relates to networking technology area, be specifically related to method, the Apparatus and system of search and webpage and set up the device of index data base.
Background technology
Develop rapidly along with information retrieval technique, the document information retrieval technology has entered the stage of a comparative maturity, from the most original keyword matching analyzing or the like till now based on contextual analysis, pattern match, example coupling and applied statistics strategy, formed one and overlapped more complete thinking and perfect algorithm, and be widely applied on all kinds of search engines.
Existing is such for the user provides the method for search and webpage: at first the collecting web page device grasps webpage by webpage capture programs such as Web Spider from the internet, webpage is sent into the original web page database, the collecting web page device extracts URL(uniform resource locator) (URL:Uniform Resource Locator) and gives the judgement of collection controller from webpage, collect the URL that controller obtains webpage, the Control Network spider grasps other webpage, circulates repeatedly to finish up to all webpages are grasped.
System obtains text message from the original web page database, single webpage is carried out pre-service, sends into " text indexer " module and sets up index, forms index data base; Carry out link information simultaneously and extract, link information is sent into the link analysis module set up the webpage grading, form link grading storehouse, wherein, link information comprises information such as anchor text, link itself.
The user gives querying server by submitting query requests to, querying server carries out searching of related web page in index data base, the storehouse of link grading simultaneously combines the evaluation of Search Results being carried out the degree of correlation to query requests and link information, sort according to the degree of correlation by querying server, and the synopsis of extraction key word, return to the user by user interface format inquiring display content at last.
From the above, because prior art is to be that index is analyzed in the unit with single web page contents, though the webpage clear and definite and concentrated to subject informations such as news web pages can obtain Search Results preferably, but comprised numerous users for single webpage information has been discussed, and forum's webpage of discussion group of the forum character that each discussion information is relatively short and small, because each webpage comprises one or more model contents, corresponding forum's clue (Thread) also is distributed in one or more webpages, then according to existing be that the mode that index is analyzed in the unit is difficult to obtain Search Results preferably with single web page contents.
Summary of the invention
The purpose of the embodiment of the invention provides method, the Apparatus and system of search and webpage and sets up the device of index data base, and the technical scheme of using the embodiment of the invention to provide can be that index is analyzed to forum's webpage in the unit with forum's clue.
The purpose of the embodiment of the invention is achieved through the following technical solutions:
A kind of method of search and webpage comprises:
Obtain the user inquiring speech;
From preset index data base, search the forum clue corresponding with described user inquiring speech;
The described forum clue that inquires is formatd processing, the forum's clue after the output format processing.
A kind of device of setting up forum's clue database comprises:
The original web page acquiring unit is used to obtain untreated original web page;
Forum's clue template recognition unit is used to use the forum's clue template base that presets to identify forum's clue template of described original web page correspondence;
Information extraction unit is used for extracting the information that described forum clue template is identified from described original web page, and described information comprises forum's sign;
Information is preserved the unit, is used in forum's clue database and the corresponding described information of list item preservation of described forum sign.
A kind of device of setting up index data base comprises:
Forum's clue acquiring unit is used for obtaining the corresponding forum's clue of forum's clue sign from forum's clue database;
The set of keywords acquiring unit is used for described forum clue is carried out pre-service, obtains the set of keywords of the described forum of expression clue;
Information is preserved the unit, is used for described forum clue and described set of keywords correspondence are saved to index data base.
A kind of device of search and webpage comprises:
User inquiring speech acquiring unit is used to obtain the user inquiring speech;
Forum's clue is searched the unit, is used for searching the forum clue corresponding with described user inquiring speech from index data base;
Forum's clue output unit is used for the described forum clue that inquires is formatd processing, and the forum's clue after format is handled is exported to the user.
A kind of system of search and webpage comprises:
Set up the device of forum's clue database, be used to obtain untreated original web page; Forum's clue template base that use is preset identifies forum's clue template of described original web page correspondence; Extract the information that described forum clue template is identified from described original web page, described information comprises forum's sign; Identify the described information of preservation in the corresponding list item at forum's clue database with described forum;
Set up the device of index data base, be used for obtaining the corresponding forum's clue of forum's clue sign from described forum clue database; Described forum clue is carried out pre-service, obtain the set of keywords of the described forum of expression clue; Described forum clue and described set of keywords correspondence are saved to index data base;
The device of search and webpage is used to obtain the user inquiring speech; From described index data base, search the forum clue corresponding with described user inquiring speech; The forum's clue that inquires is formatd processing, the forum's clue output after format is handled.
The above technical scheme that provides from the embodiment of the invention as can be seen, because the embodiment of the invention can be returned the forum index corresponding with query word to the user according to user's query word, thereby it is the Query Result of unit that the user is obtained with forum's index, is the Query Result of unit and can not return traditional with forum's webpage, makes the Query Result that returns to the user more accurate.
Description of drawings
Fig. 1 is for setting up the structural drawing of the device embodiment one of forum's clue database in the embodiment of the invention;
Fig. 2 is for setting up the structural drawing of the device embodiment two of forum's clue database in the embodiment of the invention;
Fig. 3 is for setting up the structural drawing of the device of index data base in the embodiment of the invention;
Fig. 4 is the process flow diagram of the method embodiment one of search and webpage in the embodiment of the invention;
Fig. 5 is the process flow diagram of the method embodiment two of search and webpage in the embodiment of the invention;
Fig. 6 is the process flow diagram of the method embodiment three of search and webpage in the embodiment of the invention;
Fig. 7 is the structural drawing of the device embodiment of search and webpage in the embodiment of the invention;
Fig. 8 is the structural drawing of the system embodiment of search and webpage in the embodiment of the invention.
Embodiment
For make purpose of the present invention, technical scheme, and advantage clearer, below with reference to the accompanying drawing embodiment that develops simultaneously, the present invention is described in more detail.
The device 10 of setting up forum's clue database that the embodiment of the invention provides comprises as shown in Figure 1:
Original web page acquiring unit 101 is used to obtain untreated original web page.
Original web page is meant the still untreated webpage that grasps from network, the acquisition process of original web page is same as the prior art, detailed process is as follows: collecting web page device 11 is kept at the webpage that grasps in the original web page database 13 by webpage capture programs such as Web Spider traversal web space; Wherein, the extracting process of collecting web page device is subjected to collection controller 12 controls;
Thereby when needs obtain original web page, can directly from the original web page database, obtain.
Forum's clue template recognition unit 102 is used to use the forum's clue template base 14 that presets to identify forum's clue template of original web page correspondence.
Present embodiment has only been described the situation of forum's clue template that can identify the original web page correspondence, the situation that also may occur not discern in actual applications, if can not discern, then need this original web page is done corresponding processing, for example can directly abandon, perhaps it is analyzed, obtain its corresponding forum's clue template, and the forum's clue template that obtains is saved in forum's clue template base 14; Because original web page all has its corresponding design feature, thereby it all has forum's clue template of unique correspondence.
Preserved predefined forum clue template in forum's clue template base, the possible list item form of a kind of forum clue template is as shown in table 1:
Table 1, forum's clue template table
Forum's sign Network address URL Original forum clue sign is extracted sign Sign is extracted in forum's clue paging Model contents extraction sign .....
Forum (Forum) 1 http://bbs.t est01.com/ read.php?tid=??&fp age=0&toread=&pa ge=×× read.php?tid=××&f page=0&toread=&p age=?? ××× ......
Forum2 http://bbs.t est02.com/ ??/ShowPost.aspx? PageIndex=×× ××/ShowPost.aspx? PageIndex=?? ××× ......
...... ...... ...... ...... ...... ......
As shown in table 1, preserve information such as forum's sign, network address URL, original forum clue sign extraction sign, forum's clue paging extraction sign, model contents extraction sign in forum's clue template table, extract sign by these and can from original web page, extract corresponding information, wherein original forum clue sign is the sign that each heterogeneous networks forum distributes the forum's clue under it, does not have repetition in same forum.
When discerning, need from original web page, to extract earlier the information of describing in forum's clue template table, for example can extract the network address URL of original web page etc., go coupling according to the information of having preserved in the information of extracting and the forum's clue template table then; Different forums are owing to represent the parameter difference of structure organization, and content of pages is distinguished the form difference, so need set up different pattern match information to different forum's contents, the system that makes can obtain relevant content information according to predefined mode parameter; Is a kind of feasible implementation by the URL address of original web page being analyzed whether the forum of coupling clue template is arranged, supposing that URL is http://bbs.test01.com/read.php? tid=48395﹠amp; Fpage=0﹠amp; Toread=﹠amp; Page=2 matches the forum that forum in the predefined pattern is designated Foruml by therefrom extracting bbs.test01.com, and promptly can identify its corresponding forum's clue template is forum's clue template that Foruml represents;
Information extraction unit 103 is used for extracting the information that forum's clue template is identified from original web page, identifies comprising forum;
After identifying forum's clue template of original web page correspondence, then according to the forum's clue template that matches, therefrom extract the forum's clue that this forum's webpage comprises and the related data information of model, wherein, the information of extracting is to identify in forum's clue template, because only the information that identifies in forum's clue template just can have corresponding list item in database, the information of only extracting forum's clue template identification can guarantee that the information of extracting can preserve at database; Concrete information extraction is the analysis of basis to the original web page of forum's webpage, the tectonic information marking structure comes according to the corresponding data of different structure extraction, this message identification structure is different and different according to the specific implementation language of webpage, for example realize to use html tag tree structure, realize to use xml mark structure etc. with the xml language with the html language; For example, the possible form of the html tag tree structure that provides of the embodiment of the invention is as described below:
A kind of tag tree of possible extraction model content is as follows:
<DIV?id=main>
<FORM?name=delatc?action=masingle.php?action=delatc?method=post>
<DIV?class=″tt2″>
<TR?class=trl>
<TH?class=r_one>
<DIV?class=tpc_content>......</DIV>
</TH>
</TR>
</DIV>
</FORM>
</DIV>
Wherein,<DIV class=tpc_content ...</DIV in content be the model content;
A kind of tag tree that judges whether that theme pastes is as follows:
<DIV?id=main>
<FORM?name=delatc?action=masingle.php?action=delatc?method=post>
<A?name=tpc></A>
<DIV?class=″t?t2″>
......
</DIV>
</FORM>
</DIV>
If<A name=? the A of〉</〉 in the value of name be tpc, then<DIV class=" t t2 " ...
</DIV〉represented model content is exactly theme card content; Otherwise be exactly to reply card;
After extracting information, the information of extracting is handled, be filtered when for example replying obedient content less than a preset value, the model of conductively-closed is filtered etc., then each model is created the model attributes object, produce into a model attributes object collection that comprises this forum's webpage model content; The related data information that the model attributes object comprises is including but not limited to following content: the model sign, affiliated forum clue sign, the model content, model form (representing that this model is the theme card or replys card), theme card type (elite theme for example, original, change and paste, comment, recommend, bulletin, knowledge, ballot, other, activity etc.), theme card title, the user profile of posting (user ID for example, user gradation), affiliated topic floor (expression is which the answer card in forum's clue, if theme as 0 layer), other additional attribute (for example whether top set, whether add essence etc.); Is a kind of possible mode to obtain original forum clue sign by the URL adress analysis to original web page to suppose that the URL of original web page is http://bbs.test01.com/read.php? tid=48395﹠amp; Fpage=0﹠amp; Toread=﹠amp; Page=2 is designated 48395 by therefrom extracting original forum clue;
Certainly, specifically obtaining which information can be set according to concrete needs by system, comprise forum's sign in the information of choosing, it is unique that forum is identified in forum's clue database, just can determine the forum corresponding position of information in forum's clue database of sign by forum's sign;
Information is preserved unit 104, is used in forum's clue database 15 and the corresponding list item preservation information of forum's sign;
After having obtained the information that forum's clue template identified, the information of obtaining is saved in the forum clue database list item corresponding with described forum sign;
In actual applications, because forum is bigger, the corresponding a plurality of list items of forum's sign meeting, be saved in information in a definite list item in order to guarantee this moment, need further obtain the original forum clue sign of original web page correspondence, thereby can guarantee directly to find the list item record corresponding with original web page, this is because original forum clue sign is the sign that each heterogeneous networks forum distributes the forum's clue under it, does not have repetition in same forum; After finding the list item corresponding with forum sign, need further in these list items, search and a corresponding list item of original forum clue sign, if find, identify renewal preservation information in the corresponding list item with original forum clue what existed; If search less than, the corresponding list item of newly-built and original forum clue sign in forum's clue database, and in this newly-built list item preservation information;
In actual applications, can also be each forum of clue identification distribution of original forum clue sign, forum's clue sign is distributed automatically by system, some original forum clue sign under a certain forum of can be in the system unique sign sign, thereby can search corresponding information by forum's clue sign, and do not need to search corresponding information by forum's sign and original forum two signs of clue sign, can improve the treatment effeciency of forum's clue database;
In forum's clue database, a kind of possible situation is to comprise forum's threaded list and model attribute list (also these two tables can be combined in actual applications certainly), and wherein a kind of possibility form of expression of forum's threaded list is as shown in table 2:
Table 2, forum's threaded list
Forum's clue sign Forum's sign Original forum clue sign ......
Thread1 Forum1 48395 ......
Thread2 Forum2 2766592 ......
...... ...... ...... ......
By the described forum of table 2 threaded list, can find corresponding forum's clue sign by forum's sign and original forum clue sign, also can search its corresponding forum sign and original forum clue sign according to forum's clue sign;
A kind of possibility form of expression of model attribute list is as shown in table 3:
Table 3, model attribute list
The model sign Forum's clue sign The model content The model form Theme card type Theme card title The user ID of posting Affiliated topic floor ......
1 Thead1 ×× The theme card Original ×× User01 0 ......
2 Thead1 ×× Reply card Do not have Do not have User02 1 ......
...... ...... ...... ...... ...... ...... ...... ...... ......
By the described model attribute list of table 3, can identify some information of searching its corresponding model by forum's clue;
Because now on the network forum, the model that somebody's gas is high has a lot of answer cards, and these reply obedient being distributed in probably on the different web pages of a model, but no matter what webpages a model has, its all only corresponding forum's clue, and present embodiment uses forum's clue as process object, and can will not belong to a plurality of webpage separate processes of same forum clue, and the Search Results when making with forum's clue as object search is more accurate.
The present invention further provides the device embodiment two that sets up forum's clue database, as shown in Figure 2, the device 20 of setting up forum's clue database comprises:
Original web page acquiring unit 201 is used to obtain untreated original web page;
Forum's clue template recognition unit 202 is used to use the forum's clue template base 14 that presets to identify forum's clue template of original web page correspondence;
Information extraction unit 203 is used for extracting the information that forum's clue template is identified from original web page, and information comprises forum's sign;
Original forum clue sign acquiring unit 204 is used for from the original forum clue sign of original web page extraction original web page correspondence;
List item is searched unit 205, is used for from forum's clue database 15 and forum's sign and the corresponding list item of original forum clue sign;
Information is preserved unit 206, is used for preserving described information at the list item corresponding with forum's sign and original forum clue sign;
In the present embodiment, by the original forum clue sign acquiring unit that increases, can obtain the original forum clue sign of original web page correspondence, by original forum clue sign, the original web page information of extracting can be saved in the list item of its corresponding forum's clue, thereby when a plurality of forums clue being arranged, can handle respectively each forum's clue in a forum, thereby when inquiry, can only find corresponding information, system handles efficient is provided by forum's clue sign.
In actual applications, possible certain original forum corresponding list item of clue sign does not exist, need increase a list item and set up the unit this moment in the device embodiment that sets up forum's clue database, be used at the corresponding list item of the newly-built and original forum of clue database of forum clue sign; Further, if in forum's clue database, do not identify corresponding list item, also can newly-builtly in forum's clue database identify corresponding list item with forum with certain forum.
The device of setting up index data base 31 that the embodiment of the invention provides comprises as shown in Figure 3:
Forum's clue acquiring unit 311 is used for obtaining the corresponding forum's clue of forum's clue sign from forum's clue database 15;
Forum's clue acquiring unit is by sending the message of request forum clue to forum's clue database, forum's clue database after receiving this message, though but return the information that does not have forum's clue of having upgraded behind indexed mistake or the indexed mistake index to forum's clue acquiring unit; The quantity of forum's clue of wherein, specifically returning can specifically be provided with according to concrete needs;
Forum's clue database can be set up by the described device of setting up forum's clue database of Fig. 1;
Set of keywords acquiring unit 312 is used for forum's clue is carried out pre-service, obtains the set of keywords of the corresponding forum's clue of expression forum clue sign;
Pre-service includes but not limited to word segmentation and/or filtration, and carrying out word segmentation is in order to remove nonsensical words, as " " etc.; Some responsive word is that law or miscellaneous stipulations institute are unallowed, so also need to filter; Thereby obtain to represent some key words of this forum's clue; Most importantly to carry out aforesaid operations to the model content;
Information is preserved unit 313, is used for forum's clue, set of keywords are saved to index data base 32;
Carry out word segmentation and filtration by information to original web page, can obtain to identify the key word of forum's clue content, thereby when providing Webpage search for the user, can be according to keyword search to corresponding forum's clue, thereby can be with a plurality of webpage separate processes of a model, the Search Results when making with forum's clue as object search is more accurate.
In actual applications, in order to make the information of preserving in the index data base more perfect, thereby provide more information during for search and webpage, can in setting up the device of index data base, further increase:
The co-occurrence frequency statistic unit and/or be used for that is used for adding up the co-occurrence frequency of set of keywords key word is added up single text vocabulary frequency statistics unit of single text vocabulary frequency of set of keywords key word, and information is preserved the unit and preserve co-occurrence frequency and/or single text vocabulary frequency accordingly in index data base;
Wherein co-occurrence frequency is at the distributing position of key word in forum's clue, adds up its appearance situation in a plurality of models; For example, a kind of mode of simple statistics key word co-occurrence frequency can be like this: for each model, as long as key word occurs therein, no matter how many times appears, all be defined as 1, if all occurred in certain key word five models therein like this, then defining its co-occurrence frequency is 5, even it has all occurred in each model 3 times; Certainly, this is the simplest a kind of statistical, and in actual applications, position and frequency difference according to the key word appearance, different weights can be set respectively, for example appear at the weights of theme in pasting than appearing at the weights height of replying in the card, the number of times that occurs in forum's clue weights more at most is high more;
In index data base, increase co-occurrence frequency and/or single text vocabulary frequency of preserving key word, can sort according to co-occurrence frequency and/or single text vocabulary frequency and return Search Results to the user, make forum's clue that more can meet the user inquiring speech preceding, thereby make the user can obtain it faster and want the content obtained, satisfy user's needs, improve user satisfaction.
A kind of index data base that the embodiment of the invention provides comprises forum's clue forward concordance list and forum's clue inverted index table; Forum's clue forward table is as shown in table 4:
Table 4, forum's clue forward concordance list
Figure C20071013634500181
As shown in table 4, forum's clue forward concordance list is an index with forum's clue, and writes down the set of keywords of each forum's index respectively, has also write down information such as single text vocabulary frequency of each key word, co-occurrence frequency in the set of keywords;
Forum's clue inverted index table is as shown in table 5:
Table 5, forum's clue inverted index table
Figure C20071013634500182
As shown in table 5, forum's clue inverted index table is index with the key word, and writes down which forum's index respectively this key word is arranged, and in this forum's index the information such as single text vocabulary frequency, co-occurrence frequency of this key word;
Table 4 and table 5 have just been described a kind of mode that realizes index data base, may only need one of them table in actual applications, perhaps also can make up more table.
The present invention further provides the method embodiment one of search and webpage, as shown in Figure 4, having comprised:
Step 401, acquisition user inquiring speech;
When the user need inquire about a content, can import corresponding query word by the interface that search engine provides;
Step 402, from index data base, search the forum clue corresponding with the user inquiring speech;
Wherein, index data base can be set up by the described flow process of Fig. 2;
After obtaining the user inquiring speech, just can in index data base, search corresponding forum's clue as key word with the user inquiring speech;
Further, in actual applications, because the user inquiring speech of user's input may not meet the requirement of key word, thereby need before from index data base, searching the user inquiring speech of user's input is carried out word segmentation and/or filtration, it is in order to remove words nonsensical in the user inquiring speech that the user inquiring speech is carried out word segmentation, as " " etc., and the user inquiring speech is carried out word segmentation can obtain the word identical, make search more accurate with key word; Some responsive word is that law or miscellaneous stipulations institute are unallowed, so also need the user inquiring speech is filtered;
Step 403, the forum's clue that inquires is formatd processing, the forum's clue after the output format processing;
In order to make the user can understand the information of each forum's clue in the Search Results, need carry out certain format to forum's clue handles, as show some model contents, with wherein highlighted demonstration of key word etc., make the user can not open corresponding web page interlinkage and just can know content corresponding, thereby allow the user find as soon as possible to want the content of searching for;
The technical scheme of using present embodiment to provide, can return the forum index corresponding to the user according to user's query word with query word, thereby it is the Query Result of unit that the user is obtained with forum's index, and can be, thereby make the Query Result that returns to the user more accurate with a plurality of webpage separate processes of forum's index.
The present invention also provides the method embodiment two of search and webpage, as shown in Figure 5, comprising:
Step 501, acquisition user inquiring speech;
Step 502, the user inquiring speech is carried out pre-service, obtain key word of the inquiry;
Step 503, from index data base, search the forum clue corresponding, obtain the sequencing information of key word of the inquiry with key word of the inquiry;
Step 504, the forum's clue that inquires is formatd processing, the forum's clue after format is handled is according to the sequencing information output of sorting;
In actual applications, this sequencing information can be co-occurrence frequency and/or single text vocabulary frequency and/or a kind of or its combination in any wherein such as some other for example link quality, user's click volume information, if be that the value that obtains after a kind of can directly the processing according to the value of information or to it sorts, if combination, can be worth accordingly according to presetting algorithm computation, be sorted according to the value that calculates; Forum's clue is sorted, be convenient to the information that the user better obtains Search Results;
For example, if when only obtaining single text vocabulary frequency, need the contrary text frequency of the single text vocabulary frequency correspondence of statistics, the ratio that adopts single text vocabulary frequency and contrary text frequency then is as the foundation that sorts; The ratio of single text vocabulary frequency and contrary text frequency is the more information of using in the existing Webpage search technology, represent the key word that occurs in certain webpage to account for the weight degree of this web page contents, this value is high more, the weight that this key word accounts for this web page contents is big more, can represent the content of this webpage more; To be the number of times that occurs with key word in certain webpage obtain divided by the total number of word of this webpage wherein single text vocabulary frequency (TF:Term Frequency); Contrary text frequency (IDF:Inverse Document Frequency) expression " inverse document frequency " supposes that a key word w occurred in Dw webpage, Dw is big more so, and the weight of w is more little, and vice versa; Its computing formula is log (D/Dw), and wherein D is whole webpage numbers;
If only obtain co-occurrence frequency, then can be directly according to the numerical ordering of co-occurrence frequency;
If when obtaining TF, also obtain co-occurrence frequency, to handle TF earlier, obtain the value of TF/IDF, then TF/IDF and two values of co-occurrence frequency are handled, thereby obtain the relevance degree that can represent key word and forum's clue content; A kind of feasible method is to calculate according to the different weights of two values, and the weight of supposing TF/IDF is w 1, the weight of co-occurrence frequency is w 2(w 1+ w 2=1), then can pass through w 1* TF/IDF+w 2* co-occurrence frequency calculates relevance degree;
Each forum's clue all has the co-occurrence frequency of corresponding key word, and the co-occurrence frequency of key word is the degree of correlation that can reflect forum's clue and key word to a certain extent, so forum's clue is sorted according to co-occurrence frequency, the row front that degree of correlation is high can allow the user find it to want the information of looking for faster; When the degree of correlation of several forums clue is identical, can be randomly ordered to this several forums clue, perhaps by its sequencing ordering in the clue database, also can adopt other method;
Equally,, also comprise, weight is set for each sequencing information, adopt corresponding algorithm computation to go out relevance degree as information such as link quality, user's click volumes if the sequencing information that obtains had both comprised TF and co-occurrence frequency;
In the technical scheme that present embodiment provides, further the degree of correlation according to forum's index and user inquiring speech sorts to forum's index, thereby make with the corresponding more forum clue row of user inquiring speech more before, be that the user can find it to think information inquiring as soon as possible, improve user's satisfaction.
In order more clearly to describe the implementation procedure of the technical scheme that the embodiment of the invention provides, the embodiment of the invention further provides the method embodiment three of search and webpage, this embodiment has described from obtaining original web page, whole flow processs to the output Webpage searching result, shown in figure six, comprising:
Step 601, obtain untreated original web page;
Forum's clue template base that step 602, use are preset identifies forum's clue template of this original web page correspondence;
Step 603, forum's clue that the corresponding forum's clue template of extraction is identified from this original web page;
In actual applications, this information can be saved to forum's clue database after having extracted forum's clue;
Step 604, forum's clue is carried out word segmentation and filtration, obtain the set of keywords of the described forum of expression clue;
The TF and the co-occurrence frequency of the key word in step 605, the statistics set of keywords;
Step 606, the TF and the co-occurrence frequency of the key word in forum's clue, the set of keywords, key word is saved to index data base;
Step 607, acquisition user inquiring speech;
Step 608, the user inquiring speech is carried out word segmentation and filtration, obtain key word of the inquiry;
Step 609, from index data base, search the forum clue corresponding with key word of the inquiry;
Step 610, the forum's clue that inquires is formatd processing;
Step 611, the TF that from index data base, obtains key word of the inquiry and co-occurrence frequency;
The IDF of step 612, statistical query key word calculates TF/IDF, uses TF/IDF and co-occurrence frequency to calculate the relevance degree of key word of the inquiry and forum's clue;
What forum's clues IDF has comprise this key word of the inquiry in the current whole index data base of statistics;
Step 613, press the forum's clue after the processing of relevance degree ordering output format;
Use present embodiment, can be after obtaining original web page, determine forum's clue of original web page correspondence, extract corresponding information, obtain the set of keywords of expression forum clue, the TF and the co-occurrence frequency of the key word in the statistics set of keywords, key word in user inquiring key word and this set of keywords is at once, can determine that this forum's clue meets user's needs, certainly in index data base, have a lot of the forum's clues that meet user's needs, thereby obtain the relevance degree of each forum's clue and user inquiring key word according to TF/IDF and co-occurrence frequency, then according to relevance degree with forum's clue ordering output; Make the user obtain the forum clue relevant, and forum's clue is according to relevance degree ordering with the user inquiring key word, relevance degree is high more come more before, make the user find it to think information inquiring as soon as possible, thereby improve user satisfaction.
The embodiment of the invention provides the device 70 of search and webpage, as shown in Figure 7, comprising:
User inquiring speech acquiring unit 701 is used to obtain the user inquiring speech;
Forum's clue is searched unit 702, is used for searching the forum clue corresponding with the user inquiring speech from index data base 32;
Forum's clue output unit 703 is used for the forum's clue that inquires is formatd processing, and forum's clue that will format after handling is exported to the user;
The technical scheme of using present embodiment to provide, can return the forum index corresponding to the user according to user's query word with query word, thereby it is the Query Result of unit that the user is obtained with forum's index, and can be, thereby make the Query Result that returns to the user more accurate with a plurality of webpage separate processes of forum's index.
Further, in actual applications,, thereby can in the device embodiment of search and webpage, further comprise because the user inquiring speech of user's input may not meet the requirement of key word:
The user inquiring speech is carried out the key word of the inquiry acquiring unit of word segmentation and filtration treatment, thereby obtain key word of the inquiry;
Forum's clue is searched the unit, just can search the forum clue corresponding with key word of the inquiry from index data base according to key word of the inquiry; Because of key word of the inquiry obtains by the user inquiring speech, thereby forum's clue of searching is also corresponding with the user inquiring speech;
Further, can find the information that it is wanted as soon as possible, can forum's clue of output be sorted, thereby can also in the device embodiment of search and webpage, comprise in order to make the user:
Be used for obtaining the sequencing information acquiring unit of forum's clue key word of the inquiry sequencing information;
Sequencing information can be TF and/or co-occurrence frequency etc., after having obtained information such as TF information, co-occurrence frequency, forum's clue output unit, according to the TF/IDF value that calculates or co-occurrence frequency value or the relevance degree that calculates forum's clue is sorted, and forum's clue is exported to the user according to ranking results; Thereby make with the corresponding more forum clue row of user inquiring speech more before, make the user find it to think information inquiring as soon as possible, improve user's satisfaction.
The system of the search and webpage that the embodiment of the invention is mentioned comprises as shown in Figure 8:
Set up the device 801 of forum's clue database, be used to obtain untreated original web page; Forum's clue template base that use is preset identifies forum's clue template of original web page correspondence; Extract the information that forum's clue template is identified from original web page, information comprises forum's sign; Identify the described information of preservation in the corresponding list item at forum's clue database with forum;
Set up the device 802 of index data base, be used for obtaining the corresponding forum's clue of forum's clue sign from forum's clue database; Forum's clue is carried out word segmentation and filter operation, obtain the set of keywords of expression forum clue; Forum's clue, set of keywords are saved to index data base;
The device 803 of search and webpage is used to obtain the user inquiring speech; From index data base, search the forum clue corresponding with described user inquiring speech; The described forum clue that inquires is formatd processing, and will format the forum's clue output after handling.
The technical scheme of using present embodiment to provide, can return the forum index corresponding to the user according to user's query word with query word, thereby it is the Query Result of unit that the user is obtained with forum's index, and can be, thereby make the Query Result that returns to the user more accurate with a plurality of webpage separate processes of forum's index.
Be understandable that, method, the Apparatus and system of the search and webpage that the embodiment of the invention can be provided are applied in the web page search engine, this web page search engine can be single forum's search engine, it also can be the comprehensive search engine, thereby make search engine when forum's webpage is searched for, use forum's clue to handle as unit, improve the accuracy of search engine institute return message, user satisfaction is provided.
One of ordinary skill in the art will appreciate that all or part of step that realizes in the foregoing description method is to instruct relevant hardware to finish by program, described program can be stored in a kind of computer-readable recording medium, this program comprises the steps: when carrying out
Obtain the user inquiring speech;
From index data base, search the forum clue corresponding with described user inquiring speech;
The described forum clue that inquires is formatd processing, the forum's clue after the output format processing;
The above-mentioned storage medium of mentioning can be a ROM (read-only memory), disk or CD etc.
More than method, the Apparatus and system of the search and webpage that the embodiment of the invention provided and the device of setting up index data base are described in detail, the explanation of above embodiment just is used for helping to understand method of the present invention and thought thereof; Simultaneously, for one of ordinary skill in the art, according to thought of the present invention, the part that all can change in specific embodiments and applications, in sum, this description should not be construed as limitation of the present invention.

Claims (25)

1, a kind of method of search and webpage is characterized in that, comprising:
Obtain the user inquiring speech;
From preset index data base, search the forum clue corresponding with described user inquiring speech;
The described forum clue that inquires is formatd processing, the forum's clue after the output format processing.
2, the method for search and webpage as claimed in claim 1 is characterized in that, obtains further to comprise behind the user inquiring speech:
Described user inquiring speech is carried out pre-service, obtain key word of the inquiry;
Describedly from preset index data base, search the forum clue corresponding and be specially: from index data base, search the forum clue corresponding with described user inquiring speech according to described key word of the inquiry with described user inquiring speech.
3, the method for search and webpage as claimed in claim 2 is characterized in that, the forum's clue after the output format processing takes a step forward and comprises:
Obtain the sequencing information of key word of the inquiry described in the described forum clue;
According to the forum's clue after the described format processing of described sequencing information ordering output.
4, the method for search and webpage as claimed in claim 3 is characterized in that, if described sequencing information is single text vocabulary frequency, described forum's clue of exporting after described format is handled according to the sequencing information ordering is specially:
Statistics and the corresponding contrary text frequency of described single text vocabulary frequency;
According to of the ratio ordering of described single text vocabulary frequency, export the forum's clue after described format is handled with contrary text frequency.
5, the method for search and webpage as claimed in claim 3 is characterized in that, if described sequencing information is a co-occurrence frequency, according to the forum's clue after the described format processing of described co-occurrence frequency ordering output.
6, the method for search and webpage as claimed in claim 3 is characterized in that, if described sequencing information is single text vocabulary frequency and co-occurrence frequency, described forum's clue of exporting after described format is handled according to the sequencing information ordering is specially:
Described single text vocabulary frequency and co-occurrence frequency by presetting algorithm, are calculated the relevance degree of described key word of the inquiry and described forum clue;
According to the forum's clue after the described format processing of described relevance degree ordering output.
7, the method for search and webpage as claimed in claim 1 is characterized in that, described index data base is set up by following flow process:
From forum's clue database, obtain the corresponding forum's clue of forum's clue sign;
Described forum clue is carried out pre-service, obtain the set of keywords of the described forum of expression clue;
Described forum clue and described set of keywords correspondence are saved to index data base.
8, the method for search and webpage as claimed in claim 7 is characterized in that, further adds up the co-occurrence frequency of key word in the described set of keywords;
Further in described index data base, preserve described co-occurrence frequency.
9, as the method for claim 7 or 8 described search and webpages, it is characterized in that, further add up single text vocabulary frequency of key word in the described set of keywords;
Further in described index data base, preserve described single text vocabulary frequency.
As the method for claim 7 or 8 described search and webpages, it is characterized in that 10, described forum clue database adopts following flow process to set up:
Obtain untreated original web page;
Forum's clue template base that use is preset identifies forum's clue template of described original web page correspondence;
Extract the information that described forum clue template is identified from described original web page, described information comprises forum's sign;
Identify the described information of preservation in the corresponding list item at forum's clue database with described forum.
11, the method for search and webpage as claimed in claim 10 is characterized in that, further extracts the original forum clue sign of described original web page correspondence from described original web page;
The described information of preservation takes a step forward and comprises in forum's clue database list item corresponding with described forum sign:
From forum's clue database lookup and described forum sign and the corresponding list item of described original forum clue sign, in the list item corresponding, preserve described information with described forum sign and described original forum clue sign.
12, the method for search and webpage as claimed in claim 11 is characterized in that, the described information of preservation takes a step forward and comprises in the list item corresponding with described forum sign and described original forum clue sign:
Judge whether the list item corresponding with described original forum clue sign exists, if enter and identifying the step of the described information of preservation in the corresponding list item with described forum sign and described original forum clue; If not, newly-built and described forum sign and the corresponding list item of described original forum clue sign enter the step of preserving described information in the list item corresponding with described forum sign and described original forum clue sign in described forum clue database.
13, as the method for claim 7 or 8 described search and webpages, it is characterized in that, described forum clue is carried out pre-service, the set of keywords that obtains the described forum of expression clue is specially:
Described forum clue is carried out word split and/or filter, obtain the set of keywords of the described forum of expression clue.
14, a kind of device of setting up forum's clue database is characterized in that, comprising:
The original web page acquiring unit is used to obtain untreated original web page;
Forum's clue template recognition unit is used to use the forum's clue template base that presets to identify forum's clue template of described original web page correspondence;
Information extraction unit is used for extracting the information that described forum clue template is identified from described original web page, and described information comprises forum's sign;
Information is preserved the unit, is used in forum's clue database and the corresponding described information of list item preservation of described forum sign.
15, the device of setting up forum's clue database as claimed in claim 14 is characterized in that, further comprises:
Original forum clue sign acquiring unit is used for from the original forum clue sign of the described original web page correspondence of described original web page extraction;
List item is searched the unit, is used for from forum's clue database lookup and described forum sign and the corresponding list item of described original forum clue sign;
Described information is preserved the unit and is used for preserving described information at the list item corresponding with described forum sign and described original forum clue sign.
16, the device of setting up forum's clue database as claimed in claim 15 is characterized in that, does not find described and described forum sign and the corresponding list item of described original forum clue sign if described list item is searched the unit, further comprises:
List item is set up the unit, is used at described forum clue database newly-built and described forum sign and the corresponding list item of described original forum clue sign.
17, a kind of device of setting up index data base is characterized in that, comprising:
Forum's clue acquiring unit is used for obtaining the corresponding forum's clue of forum's clue sign from forum's clue database;
The set of keywords acquiring unit is used for described forum clue is carried out pre-service, obtains the set of keywords of the described forum of expression clue;
Information is preserved the unit, is used for described forum clue and described set of keywords correspondence are saved to index data base.
18, the device of setting up index data base as claimed in claim 17 is characterized in that, also comprises:
The co-occurrence frequency statistic unit is used for adding up the co-occurrence frequency of described set of keywords key word;
Described information is preserved the unit and also is used for described co-occurrence frequency is saved to described index data base.
19, as claim 17 or the 18 described devices of setting up index data base, it is characterized in that, also comprise:
Single text vocabulary frequency statistics unit, the single text vocabulary frequency that is used for adding up described set of keywords key word;
Described information is preserved the unit and also is used for described single text vocabulary frequency is saved to described index data base.
20, a kind of device of search and webpage is characterized in that, comprising:
User inquiring speech acquiring unit is used to obtain the user inquiring speech;
Forum's clue is searched the unit, is used for searching the forum clue corresponding with described user inquiring speech from index data base;
Forum's clue output unit is used for the described forum clue that inquires is formatd processing, and the forum's clue after format is handled is exported to the user.
21, the device of search and webpage as claimed in claim 20 is characterized in that, further comprises:
The key word of the inquiry acquiring unit is used for described user inquiring speech is carried out pre-service, obtains key word of the inquiry;
Described forum clue is searched the unit, is used for searching the forum clue corresponding with described user inquiring speech according to described key word of the inquiry from index data base.
22, the device of search and webpage as claimed in claim 21 is characterized in that, further comprises:
The sequencing information acquiring unit is used to obtain single text vocabulary frequency of key word of the inquiry described in the described forum clue;
Computing unit is used to adopt that statistics obtains and the corresponding contrary text frequency of described single text vocabulary frequency, calculates described single text vocabulary frequency and ratio against the text frequency;
Described forum clue output unit is used for according to described single text vocabulary frequency and forum's clue of exporting against the ratio ordering of text frequency after described format is handled.
23, the device of search and webpage as claimed in claim 21 is characterized in that, further comprises:
The sequencing information acquiring unit is used to obtain the co-occurrence frequency of key word of the inquiry described in the described forum clue;
Described forum clue output unit is used for according to the forum's clue after the described format processing of described co-occurrence frequency ordering output.
24, the device of search and webpage as claimed in claim 21 is characterized in that, further comprises:
The sequencing information acquiring unit is used to obtain the single text vocabulary frequency and the co-occurrence frequency of key word of the inquiry described in the described forum clue;
The relevance degree computing unit is used for according to described single text vocabulary frequency and co-occurrence frequency, adopts and presets the relevance degree that algorithm is calculated described key word of the inquiry and described forum clue;
Described forum clue output unit is used for according to the forum's clue after the described format processing of described relevance degree ordering output.
25, a kind of system of search and webpage is characterized in that, comprising:
Set up the device of forum's clue database, be used to obtain untreated original web page; Forum's clue template base that use is preset identifies forum's clue template of described original web page correspondence; Extract the information that described forum clue template is identified from described original web page, described information comprises forum's sign; Identify the described information of preservation in the corresponding list item at forum's clue database with described forum;
Set up the device of index data base, be used for obtaining the corresponding forum's clue of forum's clue sign from described forum clue database; Described forum clue is carried out pre-service, obtain the set of keywords of the described forum of expression clue; Described forum clue and described set of keywords correspondence are saved to index data base;
The device of search and webpage is used to obtain the user inquiring speech; From described index data base, search the forum clue corresponding with described user inquiring speech; The forum's clue that inquires is formatd processing, the forum's clue output after format is handled.
CNB200710136345XA 2007-07-24 2007-07-24 Method, device and system for searching web page and device for establishing index database Active CN100478962C (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CNB200710136345XA CN100478962C (en) 2007-07-24 2007-07-24 Method, device and system for searching web page and device for establishing index database

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CNB200710136345XA CN100478962C (en) 2007-07-24 2007-07-24 Method, device and system for searching web page and device for establishing index database

Publications (2)

Publication Number Publication Date
CN101101605A CN101101605A (en) 2008-01-09
CN100478962C true CN100478962C (en) 2009-04-15

Family

ID=39035877

Family Applications (1)

Application Number Title Priority Date Filing Date
CNB200710136345XA Active CN100478962C (en) 2007-07-24 2007-07-24 Method, device and system for searching web page and device for establishing index database

Country Status (1)

Country Link
CN (1) CN100478962C (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20220318854A1 (en) * 2019-08-30 2022-10-06 Datascientist Inc. Content arrangement program, content arrangement device, and content arrangement method, website construction support program, website construction support device, and website construction support method, and economic scale output program, economic scale output device, and economic scale output method

Families Citing this family (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101639831B (en) * 2008-07-29 2012-09-05 华为技术有限公司 Search method, search device and search system
WO2010014954A2 (en) * 2008-08-01 2010-02-04 Google Inc. Providing posts to discussion threads in response to a search query
CN102737042B (en) * 2011-04-08 2015-03-25 北京百度网讯科技有限公司 Method and device for establishing question generation model, and question generation method and device
CN102317943B (en) * 2011-07-29 2013-10-02 华为技术有限公司 Method and device for full-text search
CN102831186A (en) * 2012-08-02 2012-12-19 深圳市同洲电子股份有限公司 Method and device for storing and searching webpage
CN103581280B (en) * 2012-08-30 2017-02-15 网易传媒科技(北京)有限公司 Method and device for interface interaction based on micro blog platform
WO2014132265A2 (en) * 2013-02-14 2014-09-04 Gyan Prakash Kesarwani An improved system and method of scanning a search engine depending on the importance of the keywords and producing an effective output
CN104951449B (en) * 2014-03-26 2020-12-01 腾讯科技(深圳)有限公司 Data processing method and device
CN105912545A (en) * 2015-12-15 2016-08-31 乐视网信息技术(北京)股份有限公司 Device, method, and system for media resource retrieval
CN109977699B (en) * 2019-03-26 2022-04-01 贝富(广州)新技术有限公司 House property information storage method, system and storage medium based on block chain
CN112052476A (en) * 2020-08-27 2020-12-08 安徽国戎科技有限公司 Military case data management method

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20220318854A1 (en) * 2019-08-30 2022-10-06 Datascientist Inc. Content arrangement program, content arrangement device, and content arrangement method, website construction support program, website construction support device, and website construction support method, and economic scale output program, economic scale output device, and economic scale output method
US11756082B2 (en) * 2019-08-30 2023-09-12 Datascientist Inc. Content arrangement program, content arrangement device, and content arrangement method, website construction support program, website construction support device, and website construction support method, and economic scale output program, economic scale output device, and economic scale output method

Also Published As

Publication number Publication date
CN101101605A (en) 2008-01-09

Similar Documents

Publication Publication Date Title
CN100478962C (en) Method, device and system for searching web page and device for establishing index database
CN101520784B (en) Information issuing system and information issuing method
CN102402604B (en) Effective forward ordering of search engine
JP5721818B2 (en) Use of model information group in search
CN103914478B (en) Webpage training method and system, webpage Forecasting Methodology and system
CN1936893B (en) Method and system for generating input-method word frequency base based on internet information
CN105095368B (en) Method and device for sequencing news information
CN103294681B (en) Method and device for generating search result
CN101169780A (en) Semantic ontology retrieval system and method
CN103136228A (en) Image search method and image search device
CN110737821B (en) Similar event query method, device, storage medium and terminal equipment
CN107885793A (en) A kind of hot microblog topic analyzing and predicting method and system
CN102073725A (en) Method for searching structured data and search engine system for implementing same
CN103365839A (en) Recommendation search method and device for search engines
CN102567494B (en) Website classification method and device
CN108416034B (en) Information acquisition system based on financial heterogeneous big data and control method thereof
CN111191111A (en) Content recommendation method, device and storage medium
CN104391978A (en) Method and device for storing and processing web pages of browsers
CN106844482B (en) Search engine-based retrieval information matching method and device
CN1629837A (en) Method and apparatus for processing, browsing and classified searching of electronic document and system thereof
CN112149422B (en) Dynamic enterprise news monitoring method based on natural language
CN112328936A (en) Website identification method, device and equipment and computer readable storage medium
KR100876214B1 (en) Apparatus and method for context aware advertising and computer readable medium processing the method
CN112818200A (en) Data crawling and event analyzing method and system based on static website
CN112269906B (en) Automatic extraction method and device of webpage text

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
ASS Succession or assignment of patent right

Owner name: BEIJING JINGDONG SHANGKE INFORMATION TECHNOLOGY CO

Free format text: FORMER OWNER: HUAWEI TECHNOLOGY CO., LTD.

Effective date: 20150619

C41 Transfer of patent application or patent right or utility model
TR01 Transfer of patent right

Effective date of registration: 20150619

Address after: 100101, No. 8 West Beichen Road, Beijing, Beichen Century Center, block A, 10, Chaoyang District

Patentee after: Beijing Jingdong Shangke Information Technology Co., Ltd.

Address before: 518129 Bantian HUAWEI headquarters office building, Longgang District, Guangdong, Shenzhen

Patentee before: Huawei Technologies Co., Ltd.