CN100478962C - Method, device and system for searching web page and device for establishing index database - Google Patents

Method, device and system for searching web page and device for establishing index database Download PDF

Info

Publication number
CN100478962C
CN100478962C CN 200710136345 CN200710136345A CN100478962C CN 100478962 C CN100478962 C CN 100478962C CN 200710136345 CN200710136345 CN 200710136345 CN 200710136345 A CN200710136345 A CN 200710136345A CN 100478962 C CN100478962 C CN 100478962C
Authority
CN
China
Prior art keywords
forum
clues
corresponding
database
information
Prior art date
Application number
CN 200710136345
Other languages
Chinese (zh)
Other versions
CN101101605A (en
Inventor
李自军
伟 王
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Priority to CN 200710136345 priority Critical patent/CN100478962C/en
Publication of CN101101605A publication Critical patent/CN101101605A/en
Application granted granted Critical
Publication of CN100478962C publication Critical patent/CN100478962C/en

Links

Abstract

本发明公开了搜索网页的方法、装置及系统和建立索引数据库的装置,使用本发明可以以论坛线索为单元对论坛网页进行分析索引;其中,方法包括:获得用户查询词;从预置索引数据库中查找与所述用户查询词对应的论坛线索;对查询到的所述论坛线索进行格式化处理,输出格式化处理后的论坛线索;本发明还相应的提供了搜索网页的装置、系统,以及建立索引数据库的装置等;通过本发明可以根据用户的查询词给用户返回与查询词对应的论坛索引,从而使用户获得以论坛索引为单位的查询结果,而不会返回传统的以论坛网页为单位的查询结果,使返回给用户的查询结果更加准确。 The present invention discloses a method of searching a web page, device and system and means for establishing the index database, the present invention may be used as clues forum to forum page analysis unit index; wherein the method comprises: obtaining user query word; index database from the preset Find the user query term corresponding to've cue; query to the forum for formatting cues, cue Forum outputting formatted; the present invention also provides a corresponding web page search apparatus, system, and apparatus such as the establishment of a database index; may return to the user by the present invention according to a user's query word forum index term corresponding to the query so that the query results to the user to obtain the unit forum index, but does not return to the forum page as conventional query results units, so returned to the user's query results more accurate.

Description

搜索网页的方法、装置及系统和建立索引数据库的装置 Search the Web method, apparatus and system and database indexing device

技术领域 FIELD

本发明涉及网络技术领域,具体涉及搜索网页的方法、装置及系统和建立索引数据库的装置。 The present invention relates to network technologies, and particularly relates to a method to search the web, apparatus, and system and database indexing means.

背景技术 Background technique

随着信息检索技术的飞速发展,文本信息检索技术进入了一个比较成熟的阶段,从最原始的关键字匹配到现在的基于上下文的分析、模式匹配、实例匹配以及应用统计策略进行分析等等,已经形成了一套比较完整的思路和完善的算法,并被广泛应用到了各类搜索引擎上。 With the rapid development of information retrieval technology, text information retrieval technology into a more mature stage, from the most primitive to match keywords to context-based analysis now, pattern matching, matching examples and application of statistical analysis strategies and so on, It has formed a relatively complete set of ideas and perfect algorithm, and is widely applied to the various search engines.

现有的为用户提供搜索网页的方法是这样的:首先网页收集器通过网络蜘蛛等网页抓取程序从互联网上抓取网页,把网页送入原始网页数据库,网页收集器从网页中提取统一资源定位符(URL: Uniform Resource Locator)交给搜集控制器判断,搜集控制器得到网页的URL,控制网络蜘蛛抓取其它网页,反复循环直到把所有的网页抓取完成。 Existing methods to provide users to search the web is this: First, collect web pages from the Internet is crawling crawling through a network such as spider web, the web page into the original database, web collector Uniform Resource extraction from web pages locator (URL: Uniform Resource locator) to collect the controller determines to collect the controller to obtain the URL of the webpage, other web spider control network, the cycle is repeated until all pages crawled completed.

系统从原始网页数据库中得到文本信息,对单个网页进行预处理,送入"文本索引器"模块建立索引,形成索引数据库;同时进行链接信息提取, The system obtained from the original page text information database, a single page, pretreatment, into "text index" module index, the database index is formed; the same link information extraction,

把链接信息送入链接分析模块建立网页评级,形成链接评级库,其中,链接信息包括锚文本、链接本身等信息。 The link information into link analysis module to create web pages rating, rating form link library, which link information including anchor text, links and other information itself.

用户通过提交查询请求给查询服务器,查询服务器在索引数据库中进行相关网页的查找,同时链接评级库把查询请求和链接信息结合起来对搜索结果进行相关度的评价,通过查询服务器按照相关度进行排序,并提取关键字的内容摘要,最后通过用户接口格式化查询显示内容返回给用户。 Users by submitting queries to the query server, query the server to find relevant pages in the index database while link rating library to combine queries and link information for search results to evaluate the degree of correlation, sorted by relevance by querying the server and extracts summary keyword, the final format through the user interface displays the contents of the query back to the user.

从上可知,由于现有技术是以单个网页内容为单元进行分析索引,虽然对新闻网页等主题信息明确且集中的网页能够获得较好的搜索结果,但是对于单个网页包含了众多的用户讨论信息、且每个讨论信息相对比较短'j、的论坛讨论组性质的论坛网页来说,由于每个网页包含一个或多个帖子内容,相 From the above, the prior art, is based on a single Web page content analysis index as a unit, although information on topics such as news pages clear and focused web page can get better search results, but for a single page that contains a large number of users to discuss information and information corresponding to each discussion short 'j, forum discussion groups properties web forum, since each web page content contains one or more posts, with

应的论坛线索(Thread)也分布于一个或多个网页中,则根据现有的以单个网页内容为单元进行分析索引的方式难以获得较好的搜索结果。 Should've cues (Thread) is also distributed to one or more web pages, it is difficult to obtain better search results according to the conventional embodiment indexed analyzed page is a single unit.

发明内容 SUMMARY

本发明实施例的目的是提供搜索网页的方法、装置及系统和建立索引数据库的装置,使用本发明实施例提供的技术方案,可以以论坛线索为单元对论坛网页进行分析索引。 Object of embodiments of the present invention to provide a method of searching a web page, device and system and a device index database is established, the present invention is the use of the technical solutions provided in the embodiments, clues may forum to forum page analysis unit indexed.

本发明实施例的目的是通过以下技术方案实现的: The object of the present invention, an embodiment is achieved by the following technical solution:

一种搜索网页的方法,包括: A method for searching the Web, including:

获得用户查询词; Access to the user query terms;

从预置索51数据库中查找与所述用户查询词对应的论坛线索; Find the user query've clue word corresponding to the cable 51 from the preset database;

对查询到的所述论坛线索进行格式化处理,输出格式化处理后的论坛线索。 Query to the forum for formatting cues, the cue Forum formatted output.

一种建立论坛线索数据库的装置,包括: 原始网页获取单元,用于获取未处理的原始网页; An apparatus established the Forum trail database, comprising: original page acquisition unit for acquiring the original untreated web;

论坛线索模板识别单元,用于使用预置的论坛线索模板库识别出所述原始网页对应的论坛线索模板; Forum template cue recognition unit, using a preset template library've cues identifying the original page template corresponding to've cue;

信息提取单元,用于从所述原始网页中提取所述论坛线索模板所标识的信息,所述信息包括论坛标识; Information extracting means for extracting from the information in the original page templates identified cues forum, the forum includes identification information;

存所述信息。 The stored information.

一种建立索引数据库的装置,包括: An apparatus for establishing the index database, comprising:

论坛线索获取单元,用于从论坛线索数据库中获取论坛线索标识对应的论坛线索; Forum clue acquisition unit for acquiring clues Forum Forum clues clues corresponding to the identifier from the forum database;

关键字集获取单元,用于对所述论坛线索进行预处理,获得表示所述论坛线索的关键字集; Keyword set acquisition unit for preprocessing the cue forum, the forum is obtained representing the set of keywords cues;

信息保存单元,用于将所述论坛线索和所述关键字集对应保存至索引数据库。 Information storage means for saving the cue and the Forum set corresponding to the keyword index database.

—种搜索网页的装置,包括: 用户查询词获取单元,用于获取用户查询词; - means species search pages, comprising: a user query word acquiring unit, configured to obtain the user query word;

论坛线索查找单元,用于从索引数据库中查找与所述用户查询词对应的论坛线索; Forum clues searching unit configured to search the user query from the forum clue word corresponding to the index database;

论坛线索输出单元,用于对查询到的所述论坛线索进行格式化处理,将格式化处理后的论坛线索输出给用户。 Forum clues output unit configured to query the forum for formatting cues, the cue Forum formatted output to the user process.

一种搜索网页的系统,包括: For searching web page, comprising:

建立论坛线索数据库的装置,用于获取未处理的原始网页;使用预置的论坛线索模板库识别出所述原始网页对应的论坛线索模板;从所述原始网页中提取所述论坛线索模板所标识的信息,所述信息包括论坛标识;在论坛线索数据库与所述论坛标识对应的表项中保存所述信息; Establishing a forum trail database, for obtaining the original untreated web;'ve cue using the preset template library identifying the original page template corresponding to've cue; extracting the original page from the identified template cue Forum information, the identification information including the Forum; the information stored in the forum, the forum trail database entry corresponding to the identifier;

建立索引数据库的装置,用于从所述论坛线索数据库中获取论坛线索标识对应的论坛线索;对所述论坛线索进行预处理,获得表示所述论坛线索的关4建字集;将所述论坛线索和所述关键字集对应保存至索引数据库; Establishing a database index, corresponding to the identifier for acquiring clues Forum've cue from the cue forum database; the forum leads preprocessing off to obtain 4 represents a character set of the built Forum cue; the forum leads and the key set corresponding to the stored index database;

搜索网页的装置,用于获得用户查询词;从所述索引数据库中查找与所述用户查询词对应的论坛线索;对查询到的论坛线索进行格式化处理,将格式化处理后的论坛线索输出。 Device search page, for obtaining user query terms; Find a group cue to the user corresponding to the query word from the index database; query to've clues formatting process, after the formatting process output leads forum .

从本发明实施例提供的以上技术方案可以看出,由于本发明实施例可以根据用户的查询词给用户返回与查询词对应的论坛索引,从而使用户获得以论坛索引为单位的查询结果,而不会返回传统的以论坛网页为单位的查询结果,使返回给用户的查询结果更加准确。 From the above technical solutions provided by embodiments of the present invention can be seen that, since the embodiment of the present invention may be returned to the user forum index term corresponding to the query according to the user's query word, thereby allowing the user to obtain query results Forum index units, and It will not return query results to a traditional forum page units, so returned to the user's query results more accurate.

附困说明 Description attached trapped

图l为本发明实施例中建立论坛线索数据库的装置实施例一的结构图;图2为本发明实施例中建立论坛线索数据库的装置实施例二的结构图; Figure l of the present embodiment established the Forum trail database apparatus in embodiment a configuration diagram of an embodiment of the embodiment of the invention; FIG. 2 is a configuration diagram according to a second embodiment of the apparatus established the Forum trail database embodiment of the present invention;

图3为本发明实施例中建立索引数据库的装置的结构图; FIG 3 means a configuration diagram of an index database established in the embodiment of the present invention;

图4为本发明实施例中搜索网页的方法实施例一的流程图; A flowchart of a method embodiment of FIG. 4 embodiment of the present embodiment searches web embodiment of the invention;

图5为本发明实施例中搜索网页的方法实施例二的流程图; A flowchart of a method according to the second embodiment searches the page in FIG. 5 embodiment of the present invention embodiment;

图6为本发明实施例中搜索网页的方法实施例三的流程图; 6 is a flowchart according to a third embodiment of the method according to the embodiment of the present invention, the search page;

图7为本发明实施例中搜索网页的装置实施例的结构图; Figure 7 block diagram of the embodiment device in web search embodiment of the invention;

图8为本发明实施例中搜索网页的系统实施例的结构图。 FIG 8 is a configuration diagram of embodiment examples search pages system according to the present invention. 具体实施方式 Detailed ways

为使本发明的目的、技术方案、及优点更加清楚明白,以下参照附图并举实施例,对本发明进一步详细说明。 For the purposes of the present invention, the technical solution and merits thereof more apparent, with reference to the accompanying drawings and the following embodiments, the present invention is further described in detail.

本发明实施例提供的建立论坛线索数据库的装置10如图1所示,包括: Means for establishing a clue forum database according to an embodiment of the present invention 10 shown in Figure 1, comprising:

原始网页获取单元101,用于获得未处理的原始网页。 Original page acquisition unit 101, for obtaining unprocessed original page.

原始网页是指从网络上抓取的尚未经过处理的网页,原始网页的获取过程是与现有技术相同的,具体过程如下:网页收集器ll通过网络蜘蛛等网页抓取程序遍历web空间,将抓取的网页保存在原始网页数据库13中;其中, 网页收集器的^^过程是受收集控制器12控制的; Refers to the original page from the network crawled acquisition process has not been processed pages, the original page is the same as the prior art, the procedure is as follows: traverse the web page collector ll space spiders crawling through a network page, the crawled pages stored in the original page database 13; wherein the process ^^ web collected by the collector is controlled by the controller 12;

因而在需要获取原始网页时,可以直接从原始网页数据库中获取。 Thus, the web page can be obtained directly from the original database at the time of need to get the original page.

论坛线索模板识别单元102,用于使用预置的论坛线索模板库14识别出原始网页对应的论坛线索模板。 Forum template recognition unit 102 clues, clue for using the preset've template library 14 identifies the original page template corresponding to've clues.

本实施例只描述了能识别出原始网页对应的论坛线索模板的情况,在实际应用中还可能出现识别不出的情况,如果识别不出,则需要对该原始网页做相应的处理,例如可以直接丢弃,或者对其进行分析,得到其对应的论坛线索模板,并将得到的论坛线索模板保存至论坛线索模板库14中;因为原始网页都有其对应的结构特点,因而其都有唯一对应的论坛线索模板。 The present embodiment described the case where only able to identify the original page template corresponding to've cues, not recognize the situation may also occur in practical applications, if not identified, it is necessary for the original page corresponding processing, for example, discarded directly, or analyze them to obtain their corresponding clues forum template, save the template to get a clue forum Forum clues template library 14; because the structural characteristics of the original page has its corresponding, and thus it has a unique correspondence the forum clue template.

论坛线索模板库中保存了预定义的论坛线索模板, 一种论坛线索模板的可能表项形式如表1所示: Forum clues to save the template library of predefined templates clues forum, the forum may be a form of entry clue template shown in Table 1:

<table>table see original document page 12</column></row> <table> <Table> table see original document page 12 </ column> </ row> <table>

如表l所示,论坛线索模板表中保存有论坛标识、网址URL、原始论坛线索标识提取标识、论坛线索分页提取标识、帖子内容提取标识等信息,通过这些提取标识可以从原始网页中提取出相应的信息,其中原始论坛线索标识是各个不同网络论坛对其所属的论坛线索分配的标识,在同一个论坛中不会有重复。 As shown in Table l, forums clues template table is stored in the Forum logo, website URL, extract the original forum to identify clues to identify, extract forum page clues to identify, extract post content identification and other information can be extracted from the original page by extracting these identity appropriate information, where the original forum clues to identify the various online forums are assigned their forum to identify clues belongs, in the same forum will not be repeated.

在进行识别时,需要先从原始网页中提取论坛线索模板表中描述的信息, 例如可以提取原始网页的网址URL等,然后根据提取到的信息与论坛线索模板表中已经保存的信息去匹配;不同论坛由于表示结构组织的参数不同,页面内容区分格式不同,所以需要对不同的论坛内容建立不同的模式匹配信息, 使得系统可以根据预定义的模式参数获得相关的内容信息; 一种可行的实现方式是通过对原始网页的URL地址来分析是否有匹配的论坛线索模板,假设URL为http://bbs.test01.com/read.php?tid=48395&&age=0&toread=&page=2,通过从中提取出bbs.test01.com匹配到预定义模式中的论坛标识为Foruml的论坛,即可以识别出其对应的论坛线索模板为Foruml表示的论坛线索模板; During recognition, starting with the original page needs to be extracted information forum clue template described in the table, for example, can fetch a URL URL of the original page, etc., and then to match the information to extract information and clues forum template table has been saved; Forum different due to the different structure of the tissue parameter indicates the page content to distinguish different formats, it is necessary to establish a forum for different information content different from the pattern matching, so the system can obtain relevant information of content according to a predefined pattern parameter; a practical realization It is through the URL of the original page to analyze whether there is a match of clues forum template, assuming that URL to http://bbs.test01.com/read.php?tid=48395&&age=0&toread=&page=2, by extracting from the bbs.test01.com matching a predefined pattern've Forum Foruml identified, i.e., it can be recognized Forum template corresponding've clues clues template Foruml representation;

信息提取单元103,用于从原始网页中提取论坛线索模板所标识的信息, 其中包括论坛标识;在识别出原始网页对应的论坛线索模板后,则根据匹配到的论坛线索模板,从中提取出该论坛网页包含的论坛线索和帖子的相关数据信息,其中, 提取的信息是论坛线索模板中标识的,因为只有在论坛线索模板中标识的信息才会在数据库中有相应的表项,只提取论坛线索模板标识的信息可以保证 Information extracting unit 103 for extracting information cues Forum template from the identified original page, including identifying the forum; After identifying the original page template corresponding cues forum, the forum leads according to template matching, extracted from the Forum pages containing relevant data and information forum posts clues of where the information is extracted forum clue template identified, because only the information identified in the forum clue template has a corresponding entry in the database, extract only forum information clue template ID can guarantee

提取的信息能在数据库保存;具体的信息提取是根据对论坛网页的原始网页 Information extracted can be saved in a database; specific information extraction is based on the original page on the web forum

的分析,构造信息标识结构来根据不同的结构提取相应的数据,该信息标识 Analysis, configured to extract information identifying the corresponding data structure depending on the configuration, the identification information

结构根据网页的具体实现语言不同而不同,例如以html语言实现可以使用html标签树结构,以xml语言实现可以使用xml标记结构等;例如,本发明实施例提供的html标签树结构的可能形式如下所述: Structure of the page specific implementation language differs, for example in html language can use html tag tree structure to xml language can use xml tag structure; for example, may be in the form of html tag tree structure according to an embodiment of the present invention is as follows the:

一种可能的提取帖子内容的标签树如下所示: <DIV id=main> A possible post content to extract a tag tree is as follows: <DIV id = main>

<FORM name=delatc action=masingle.php?action=delatc method=post> <DIV class="t t2"> <TR class=trl〉 <FORM name = delatc action = masingle.php? Action = delatc method = post> <DIV class = "t t2"> <TR class = trl>

<TH class=r—one> <TH class = r-one>

<DIV class=tpc—content>......</DIV> <DIV class = tpc-content> ...... </ DIV>

</TH> </TR> </DIV> </FORM> </DIV> </ TH> </ TR> </ DIV> </ FORM> </ DIV>

其中,<DIV class=tpc—content>......〈/DIV〉中的内容为帖子内容; Wherein the content <DIV class = tpc-content> ...... </ DIV> is the message content;

一种判断是否主题贴的标签树如下所示: <DIV id=main> A method for judging whether a label affixed topic tree are as follows: <DIV id = main>

<FORM name=delatc action=masingle.php?action=delatc method=post> <A name=tpc></A〉 <FORM name = delatc action = masingle.php? Action = delatc method = post> <A name=tpc> </A>

<DIV class="t t2"> <DIV class = "t t2">

</DIV> </FORM> </DIV> </ DIV> </ FORM> </ DIV>

若〈A name=? ? x/A〉中name的值为tpc,则〈DIV class="t t2">…… 々DIV〉所表示的帖子内容就是主题帖内容;否则就是回复帖; ? If <A name=? x/A> in name value tpc, the <DIV class = "t t2"> post content ...... 々DIV> indicated that topic posts content; otherwise reply posts;

在提取到信息后,对提取到的信息进行处理,例如回复帖内容小于一个预设值时被过滤,被屏蔽的帖子被过滤等,然后对每一个帖子创建帖子属性对象,产成一个包含该论坛网页帖子内容的帖子属性对象集;帖子属性对象 After extracting the information, the extracted information to be processed, for example, the content is filtered reply posts is less than a preset value, the message is shielded filtration, and then creates a message attribute object for each message, including the production into a post Forum page post set properties of the object content; properties of the object posts

包含的相关数据信息包含^f旦不限于以下内容:帖子标识,所属论坛线索标识、 帖子内容、帖子形式(表示该帖子是主题帖还是回复帖)、主题帖类型(例如精华主题,原创,转贴,评论,推荐,公告,知识,投票,其他,活动等)、 主题帖标题、发帖用户信息(例如用户ID,用户等级)、所属话题楼层(表示是一个论坛线索中的第几个回复帖,若是主题为0层)、其他额外属性(例如是否置顶,是否加精等); 一种可能的方式是通过对原始网页的URL地址分析来获得原始论坛线索标识假设原始网页的URL 为http://bbs.testOLcom/read.php?tid=48395&fpage=0&toread=&page=2,通过从中提取出原始论坛线索标识为48395; Related data contains information includes ^ f once limited to the following: Post logo, belongs forum clues to identify, post content, post form (indicating that the post is topic posts or reply post), the theme of post types (for example, the essence of the theme, originality, posted , reviews, recommendations, announcements, knowledge, voting, and other, activities, etc.), the title topic posts, posting user information (such as user ID, user level), belongs to the topic of floors (representation is a clue in the first few forum posts reply, if the theme of layer 0), other additional properties (such as whether top, whether plus other fine); One possible way to address through the analysis of the original page URL to get the original forum to identify clues assume the original page URL is http: / /bbs.testOLcom/read.php?tid=48395&fpage=0&toread=&page=2, through clues extracted from the original forum identified as 48395;

当然,具体获取哪些信息可以由系统根据具体的需要设定,选取的信息中包括论坛标识,论坛标识在论坛线索数据库中是唯一的,通过论坛标识就可以确定论坛标识对应的信息在论坛线索数据库中的位置; Of course, what specific information acquired according to the specific needs to be set by the system, selected information identifier included in the forum, the forum is the unique identifier in the trail database forum, the forum identified by identification information can identify the corresponding forum database've clues the location;

信息保存单元104,用于在论坛线索数据库15与论坛标识对应的表项中保存信息; Information storage unit 104, leads to the forum with the forum database 15 corresponding to the ID information stored in the entry;

在获取了论坛线索模板所标识的信息后,将获取的信息保存到论坛线索数据库与所述论坛标识对应的表项中; After obtaining information forum clue identified template, save the information to get to the forum and the forum database clues to identify the corresponding entry;

在实际应用中,由于论坛比较大, 一个论坛标识会对应多个表项,此时为了保证能够将信息保存到确定的一个表项中,需要进一步获取原始网页对应的原始论坛线索标识,从而可以保证直接查找到与原始网页对应的表项记录,这是因为原始论坛线索标识是各个不同网络论坛对其所属的论坛线索分配的标识,在同一个论坛中不会有重复;在找到与论坛标识对应的表项后, 需要进一步在这些表项中查找与原始论坛线索标识对应的一个表项,如果查找到,在已经存在的与原始论坛线索标识对应的表项中更新保存信息;如果查找不到,在论坛线索数据库中新建与原始论坛线索标识对应的表项,并在该新建表项中保存信息;在实际应用中,还可以为每个原始论坛线索标识分配一个论坛线索标识, 论坛线索标识由系统自动分配,能够在系统中唯一的标识某一 In practical applications, due to the relatively large forum, a forum will identify the corresponding number of entries, this time in order to ensure the ability to save information to determine an entry, the need to obtain further clues to the original forum to identify corresponding original page, which can find guarantee direct entry to record the original page corresponding to this it is because the original forum clues to identify the various online forums are assigned their respective clues forum logo, there will be no repeat in the same forum; find and identify forum after the corresponding entry requires further find an entry with the original forum clue corresponding to the identification of these entries, if found, already present in the original forum clues to identify the corresponding entry in the update stored information; if the search does not to, new clues in the forum with the original database Forum clues to identify the corresponding entry, and save the information in the new entry; in practical applications, but also that each of the original forum to identify clues to assign a forum to identify clues, clues forum identified automatically assigned by the system, the system can be uniquely identified in a 坛标识下的某一个原始论坛线索标识,从而可以通过论坛线索标识查找对应的信息,而不需要通过论坛标识和原始论坛线索标识两个标识来查找对应的信息,可以提高论坛线索数据库的处理效率; One clue to identify the original forum under the altar identity, which can be identified through the forum to find information corresponding clues, without the need to identify two identity through the forum and the original forum to identify clues to find the corresponding information, can improve the processing efficiency clue database forum ;

在论坛线索数据库中, 一种可能的情况是包括论坛线索表和帖子属性表(当然在实际应用中也可以将这两个表合为一个),其中论坛线索表的一种可 Forum clues in the database, one likely scenario is that include clues forum posts table and attribute table (of course, in practical applications can be two tables into one), one of the forum may be clues to the table

能表现形式如表2所示: Can expressions as shown in Table 2:

表2、-论坛线索表 Table 2 - Forum trail table

<table>table see original document page 15</column></row> <table> <Table> table see original document page 15 </ column> </ row> <table>

通过表2所描述的i仑坛线索表,可以通过"i仑坛标识和原始"i仑坛线索标识查找到对应的论坛线索标识,也可以根据论坛线索标识查找其对应的论坛标识和原始论坛线索标识; By Table 2 i Hyland altar cue table described by "i Lun altar identity and original" i Oakland altar trail identification finds the corresponding forum cue identifier may be identified to find the corresponding've identifier and original Forum The Forum clues clues identity;

帖子属性表的一种可能表现形式如表3所示: One kind of message property table may expressions as shown in Table 3:

表3、帖子属性表 Table 3 posts attribute table

<table>table see original document page 15</column></row> <table> <Table> table see original document page 15 </ column> </ row> <table>

通过表3所描述的帖子属性表,可以通过论坛线索标识查找其对应的帖子的一些信息; Table 3 posts by the property sheet description, you can find some information about the identification of the corresponding posts through forums clues;

由于现在网络论坛上,有的人气高的帖子会有很多回复帖,而这些回复帖很可能分布在一个帖子的不同网页上,但是不管一个帖子有多少个网页, 其都只对应一个论坛线索,而本实施例使用论坛线索作为处理对象,而不会将归属于同一个论坛线索的多个网页分开处理,使以论坛线索作为搜索对象时的搜索结果更加准确。 Now that online forums, some high popularity there will be many post reply post, and reply to these posts are likely to be distributed on different pages of a message, but a message no matter how many pages, which corresponds to only a forum clue, while the present embodiment uses a plurality of pages as processed Forum clues, will not belong to the same forum clue separately, so that the search results to the search target forum clues as more accurate.

本发明进一步提供了建立论坛线索数据库的装置实施例二,如图2所示, The present invention further provides a means to establish a database of clues forum according to the second embodiment, shown in Figure 2,

建立论坛线索数据库的装置20包括: Means for establishing a database of clues Forum 20 include:

原始网页获取单元201,用于获得未处理的原始网页; Original page acquisition unit 201, for obtaining the original untreated web;

论坛线索模板识别单元202,用于使用预置的论坛线索模板库14识别出原始网页对应的论坛线索模板; Forum template recognition unit 202 cues, cue using preset've template library 14 identifies the original page template corresponding to've cue;

信息提取单元203,用于从原始网页中提取论坛线索模板所标识的信息, 信息包括、论坛标识; Information extracting unit 203 for extracting information cues Forum template from the identified original page, including information, identifying the forum;

原始论坛线索标识获取单元204,用于从原始网页中提取原始网页对应的原始论坛线索标识; Original Forum cue identifier acquisition unit 204, for extracting the original forum page identifier corresponding cues of the original from the original page;

表项查找单元205,用于从论坛线索数据库15与论坛标识和原始论坛线索标识对应的表项; Entry lookup unit 205 for an entry from the trail database 15 with the Forum Forum identifier and identifier corresponding cues of the original Forum;

信息保存单元206,用于在与论坛标识和原始论坛线索标识对应的表项中保存所述信息; Information storage unit 206 for storing the identification information with the forum and to identify the corresponding cues of the original forum entry;

本实施例中,通过增加的原始论坛线索标识获取单元,可以获取原始网页对应的原始论坛线索标识,通过原始论坛线索标识,可以将提取到的原始网页信息保存到其对应的论坛线索的表项中,从而在一个论坛有多个论坛线索时,可以对每个论坛线索分别处理,从而在查询时可以仅通过论坛线索标识查找到对应的信息,提供系统处理效率。 Save the original page to the present embodiment, the identifier obtaining unit by adding the original forum cues, can obtain the original Forum clues to identify the original page corresponding to, from the original forum cue identifier can be extracted information to its corresponding've cue entries , so that when there is more than a forum Forum clues can be treated separately for each forum clues, so that when a query can be identified only find clues corresponding information through forums, provides system processing efficiency.

在实际应用中,可能某个原始论坛线索标识对应的表项并不存在,此时需要在建立论坛线索数据库的装置实施例中增加一个表项建立单元,用于在论坛线索数据库中新建与原始论坛线索标识对应的表项;进一步,若在论坛线索数据库中没有与某个论坛标识对应的表项,也可以在论坛线索数据库中新建与论坛标识对应的表项。 In practice, it may be a clue to identify the corresponding original forum entries do not exist, then need to add an entry example embodiment of the means for establishing unit established the Forum trail database for the new forum, the original trail database Forum clues to identify the corresponding entry; further, if there is no entry with a forum to identify corresponding clues in the forum database, or create a new identity with the corresponding entries Forum Forum clues in the database.

本发明实施例提供的建立索引数据库的装置31如图3所示,包括:论坛线索获取单元311,用于从论坛线索数据库15中获取论坛线索标识对应的论坛线索; Database indexing means provided in the embodiments of the present invention 31 shown in Figure 3, comprising: an obtaining unit 311 forum clues for acquiring clues Forum Forum cues corresponding to the identifier from the trail database 15 the forum;

论坛线索获取单元通过向论坛线索数据库发送请求论坛线索的消息,论坛线索数据库在收到该消息后,向论坛线索获取单元返回没有被索引过、或 Forum message leads to the acquisition unit via a database request Forum Forum clues clues clues forum database after receiving the message, the acquiring unit is not returned to the Forum leads indexed, or

虽被索引过但是索引后已经更新的论坛线索的信息;其中,具体返回的论坛线索的数量可以根据具体需要具体设置; Although the information is indexed, but the index has been updated after the forum clues; wherein the number of specific forums clues returned and may be set according to specific needs;

论坛线索数据库可以通过图1所描述的建立论坛线索数据库的装置建立; 关键字集获取单元312,用于对论坛线索进行预处理,获得表示论坛线 Forum database means clues clues forum database may be established by established as described in FIG. 1; keyword acquisition unit 312 sets, for pretreatment leads forum, the forum line represents obtain

索标识对应的论坛线索的关键字集; Cable Forum clues to identify the corresponding set of keywords;

预处理包括但不限于词语切分和/或过滤,进行词语切分是为了去除没有 Including but not limited to pre-word segmentation and / or filtration, to remove word segmentation is not

意义的字词,如"的"等;有些敏感词语是法律或者其他规定所不允许的, Meaning of words like "the" and so on; some sensitive words or other legal provisions not allowed,

所以还需要进行过滤;从而得到最能表示该论坛线索的一些关键字;最主要的是要对帖子内容进行上述操作; So it needs to be filtered; resulting in some of the keywords that best represents the forum clue; the main content of the post is to carry out such operations;

信息保存单元313,用于将论坛线索、关键字集保存至索引数据库32; Information storage unit 313, for clues to save the forum, a set of keywords to index database 32;

通过对原始网页的信息进行词语切分和过滤,可以获得能够标识论坛线索内容的关键字,从而在为用户提供网页搜索时,可以根据关键字查找到对应的i仑坛线索,从而不会将一个帖子的多个网页分开处理,4吏以-论坛线索作为搜索对象时的搜索结果更加准确。 By the information of the original page keyword word segmentation and filtering, you can get clues to be able to identify the contents of the forum, so that when the user provides web search, you can find the corresponding i Lun altar clues based on keywords, so as not to multiple pages of a post treated separately, with 4 officials - search results when a search for clues forum objects more accurately.

在实际应用中,为了使索引数据库中保存的信息更加完善,从而为搜索网页时提供更多的信息,可以在建立索引数据库的装置中进一步增加: In practice, in order to make the information stored in the index database is more complete, in order to provide more information to the search page, you can further increase in the index database device:

用于统计关键字集中关键字的共现频率的共现频率统计单元、和/或用于统计关键字集中关键字的单文本词汇频率的单文本词汇频率统计单元,信息保存单元相应的在索引数据库中保存共现频率、和/或单文本词汇频率; A co-occurrence frequency of co-occurrence frequency statistics unit statistics keyword set of keywords, and / or for a single text word frequency statistics unit, statistical information holding unit of the single keyword set key corresponding to the text word frequency index frequency stored in the database, and / or a single text word co-occurrence frequency;

其中共现频率是针对关键字在论坛线索中的分布位置,统计其在多个帖子中的出现情况;例如, 一种筒单的统计关键字共现频率的方式可以是这样: 对于每一个帖子,只要关键字在其中出现,无论出现多少次,都定义为1,这样如果某个关键字在其中的五个帖子中都出现了,则定义其共现频率为5,即使它在每个帖子中都出现了3次;当然,这只是一种最简单的统计方式,而在实际应用中,根据关键字出现的位置及频率不同,可以分别设置不同的权值,例如出现在主题贴中的权值要比出现在回复帖中的权值高,在一个论坛线索中出现的次数越多则权值越高; Which is for the co-occurrence frequency distribution of the position key clues in the forum, statistics which appear in the case of more posts; for example, a way cylinder single keyword co-occurrence frequency statistics may be such that: for each post as long as the keyword appears in which, no matter how many times are defined as the presence 1, so if a keyword in the five posts which have emerged, define its co-occurrence frequency of 5, even if it is in each post both appeared three times; of course, this is just a simple statistical methods, but in practical applications, depending on the location and frequency of keywords appear, can set different weights, respectively, for example, appear in the subject posted in right right values ​​than appear in the reply brief in the high value, the higher the number of occurrences in a forum more clues in the weight;

在索引数据库中增加保存关键字的共现频率和/或单文本词汇频率,可以根据共现频率和/或单文本词汇频率排序给用户返回搜索结果,使更能符合用户查询词的论坛线索在前,从而使用户能够较快的获取其想获取的内容,满 Save increase in the index database keyword co-occurrence frequency and / or single-frequency vocabulary text, the user can return to the search results according to the co-occurrence frequency and / or single-frequency vocabulary text sorting, so that better meet user forum clue word in the query before, enabling users to quickly access to its contents would like to get the full

足用户的需要,提高用户满意度。 Full needs of users, improve user satisfaction.

本发明实施例提供的一种索引数据库包括论坛线索正向索引表和论坛线索倒排索引表;论坛线索正向表如表4所示: An index that database according to an embodiment of the present invention comprises a forward index table forum clue clue forum inverted index table;'ve forward cue table shown in Table 4:

表4、论坛线索正向索引表<table>table see original document page 18</column></row> <table> Table 4, the Forum clue forward index table <table> table see original document page 18 </ column> </ row> <table>

单文本词汇频率 Single frequency vocabulary text

共现频率 Co-occurrence frequency

如表4所示,论坛线索正向索引表以论坛线索为索引,并分别记录每个论坛索引的关键字集,还记录了关键字集中每个关键字的单文本词汇频率、 共现频率等信息; As shown in Table 4, the forward index table Forum clues clues to forum index, and were recorded for each forum index key set, the keyword set is also recorded for each keyword single text word frequency, co-occurrence frequency information;

论坛线索倒排索引表如表5所示: Forum clues inverted index table as shown in Table 5:

表5、论坛线索倒排索引表<table>table see original document page 18</column></row> <table> Table 5, the Forum clues inverted index table <table> table see original document page 18 </ column> </ row> <table>

单文本词汇频率 Single frequency vocabulary text

共现频率 Co-occurrence frequency

如表5所示,论坛线索倒排索引表以关键字为索引,并分别记录哪些论坛索引有该关键字,以及在该论坛索引中该关键字的单文本词汇频率、共现 As shown in Table 5, the Forum clues inverted index table to keyword indexes, which were recorded and have the forum index keywords, frequency of words and a single text the keyword index at the forum, co-occurrence

频率等信息; Frequency information;

表4和表5只是描述了一种实现索引数据库的方式,在实际应用中可能 Tables 4 and 5 depict only way to implement the index database, practical applications may

只需要其中的一个表,或者也可以构建更多的表。 A table where the only, or may be constructed more tables.

本发明进一步提供了搜索网页的方法实施例一,如图4所示,包括: 步骤401、获得用户查询词; The present invention further provides a method of searching the Web first embodiment, shown in Figure 4, comprising: a step 401, the user query word is obtained;

用户需要查询一个内容时,可以通过搜索引擎提供的接口输入相应的查 Interface input when the user needs to query a content can be provided by the respective search engine search

步骤402、从索引数据库中查找与用户查询词对应的论坛线索; Step 402, the user query to find forums clue word from the corresponding index database;

其中,索引数据库可以通过图2所描述的流程建立; Wherein, the index database may be established through the process described in Figure 2;

在获取用户查询词后,就可以以用户查询词作为关键字在索引数据库中查找对应的论坛线索; After obtaining the user's query terms, the user can query terms as a key forum for clues to find the corresponding index database;

进一步,在实际应用中,由于用户输入的用户查询词可能不符合关键字的要求,因而从索引数据库中查找前需要对用户输入的用户查询词进行词语切分和/或过滤,对用户查询词进行词语切分是为了去除用户查询词中没有意义的字词,如"的"等,并且对用户查询词进行词语切分可以得到与关鍵字相同的词语,使搜索更为准确;有些敏感词语是法律或者其他规定所不允许的,所以还需要对用户查询词进行过滤; Further, in practical applications, since the user query terms entered by the user keyword may not meet the requirements, so look for the front from the index database user needs to query terms entered by the user word segmentation and / or filtering, user query words word segmentation is performed to remove the word has no meaning in terms of user queries, such as "the," and so on, and the user query word word segmentation can be carried out with the same key words, making the search more accurate; some sensitive words are not permitted by law or other provisions, it also requires the user to filter the query words;

步骤403、对查询到的论坛线索进行格式化处理,输出格式化处理后的论坛线索; In step 403, the query to the forum clues for formatting, forums clues after output formatting;

为了使用户能够明了搜索结果中每个论坛线索的信息,需要对论坛线索进行一定的格式化处理,如显示一些帖子内容,将其中的关键字高亮显示等, 4吏用户可以不打开相应的网页链4妄就可以知道相应的内容,从而让用户尽快的找到想搜索的内容; To enable users to search for information to understand the results of each forum clues, clues to the Forum needs some formatting, such as posts to display some content, which will be highlighted keywords, etc., 4 officials user can not open the corresponding 4 jump to the page chain can know the appropriate content, allowing users to quickly find the content want to search;

使用本实施例提供的技术方案,可以根据用户的查询词给用户返回与查询词对应的论坛索引,从而使用户获得以论坛索引为单位的查询结果,而不会将一个论坛索引的多个网页分开处理,从而使返回给用户的查询结果更加准确。 Using the technical solution provided in the embodiment, the user can return to the forum with the query term corresponding to the index according to a user's query word, thereby allowing the user to obtain query results Forum index as a unit, rather than the plurality of pages will be indexed forum separately, so returned to the user's query results more accurate.

本发明还提供了搜索网页的方法实施例二,如图5所示,包括: 步骤501、获得用户查询词; The present invention further provides a method of searching a web page according to the second embodiment, shown in Figure 5, comprising: a step 501, the user query word is obtained;

步骤502、对用户查询词进行预处理,获得查询关键字; Step 502, the user query term pretreatment, obtain query keywords;

步骤503、从索引数据库中查找与查询关键字对应的论坛线索,获取查询关键字的排序信息; In step 503, the query keywords to find clues to the Forum from the corresponding index database to obtain sequencing information query keywords;

步骤504、对查询到的论坛线索进行格式化处理,将格式化处理后的论坛线索按照排序信息进行排序输出; Step 504, the query to've clues for formatting, the formatting process after the Forum clues sorted according to the sorting information output;

在实际应用中,该排序信息可以是共现频率、和/或单文本词汇频率、和/ 或其他一些例如链接质量、用户点击量信息等其中的一种或其任意组合,若只是一种可以直接按照信息的值或对其进行处理后得到的值进行排序,若是组合,可以按照预置算法计算得到相应的值,按照计算得到的值进行排序; 对论坛线索进行排序,便于用户更好的获得搜索结果的信息; In practical applications, the co-occurrence information may be sorted or any combinations thereof wherein the frequency and / or frequency single text word, and / or some other link quality for example, user traffic information and the like, if only a way or direct information thereof in accordance with the value of a value obtained by processing the sorted, if the combination can be obtained according to the calculated value corresponding to a preset algorithm, sorted according to the calculated values; sort of forum cues facilitate the user better obtain information search results;

例如,若在只获取单文本词汇频率时,需要统计单文本词汇频率对应的逆文本频率,然后采用单文本词汇频率和逆文本频率的比值作为排序的依据; 单文本词汇频率和逆文本频率的比值是现有网页搜索技术中使用的较多的信息,代表某个网页中出现的关键字占该网页内容的权重程度,这个值越高, 该关键字占该网页内容的权重越大,越能够代表该网页的内容;其中单文本词汇频率(TF : Term Frequency)是用某个网页中关键字出现的次数除以该网页的总字数获得;逆文本频率(IDF: Inverse Document Frequency )表示"逆文本频率指数",假定一个关键字w在Dw个网页中出现过,那么Dw越大, w的权重越小,反之亦然;它的计算公式为log (D/Dw),其中D是全部网页数; For example, if the time gained only a single text word frequencies need to count single text word corresponding to the frequency inverse document frequency, and using the ratio to sort according to the single text word frequency and inverse document frequency; single text word frequency and inverse document frequency the ratio is more information available for use in web search technology, on behalf of the extent of the right to re-key a page that appears in the accounts for web content, the higher the value, the preemption of the contents of the keyword weight, the more to represent the content of the page; text vocabulary in which a single frequency (TF: Term frequency) is divided by the total number of words of the page with a page number of times a keyword appears obtained; inverse document frequency (IDF: inverse Document frequency) indicates " inverse document frequency index ", assuming a keyword w Dw appears in the pages over, then Dw larger weight w of the weight, and vice versa; it is calculated as log (D / Dw), wherein D is all page number;

若只获取共现频率,则可以直接按照共现频率的数值排序; If only obtain co-occurrence frequency, the value may be directly sorted according to frequency of co-occurrence;

若在获取TF的同时,还获取共现频率,先要对TF进行处理,得到TF/IDF的值,然后对TF/IDF和共现频率两个值进行处理,从而获得一个能够表示关键字与论坛线索内容的相关度值; 一种可行的方法是根据两个值的不同权重进行计算,假设TF/IDF的权重为w,,共现频率的权重为W2 ( Wi+W2=l ), 则可以通过w! fTF/IDF+w^共现频率计算得到相关度值; If the acquired TF, while also obtaining the co-occurrence frequency, first the TF, to give the value of TF / IDF, and the TF / IDF and co-occurrence frequencies of the two values ​​are processed so as to obtain a key and can be represented by correlation value Forum clue content; a viable approach is the weight calculated depending on the weights of two values, assuming TF / weight of IDF weight w ,, right co-occurrence frequency of the weight W2 (Wi + W2 = l), then ! by w fTF / IDF + w ^ co-occurrence frequency calculated correlation value;

每个论坛线索都有对应的关键字的共现频率,而关键字的共现频率是能够在一定程度上反映论坛线索与关键字的相关程度的,所以根据共现频率对论坛线索排序,相关程度高的排前面,可以让用户更快的找到其想要找的信息;当几个论坛线索的相关程度相同时,可以对这几个论坛线索随机排序, 或者按其在线索数据库中的先后顺序排序,也可以采用其他的方法; Each forum has a corresponding clue keyword co-occurrence frequency and co-occurrence frequency of the keyword is able to reflect the relevance of the Forum clues and keys, so clues to the Forum sorted according to the co-occurrence frequency to a certain extent, related high degree of front row, allowing users to quickly find the information they want to find; when several forums related to the degree of clues are the same, can these few clues forum in random order, or has in its database of clues sequential ordering, other methods may be employed;

同样,若获取的排序信息既包括TF和共现频率,还包括如链接质量、用户点击量等信息,可以给每个排序信息设置权重,采用相应的算法计算出相关度值; Similarly, if the acquired ranking information includes both co-occurrence frequency TF and, as further comprising a link quality, user traffic information, the information may be provided to each sort weights, the correlation values ​​calculated using the appropriate algorithm;

在本实施例提供的技术方案中,进一步根据论坛索引与用户查询词的相关程度对论坛索引进行排序,从而使与用户查询词越对应的论坛线索排的越前,是用户可以尽快的找到其想查询的信息,提高用户的满意度。 In the technical solution provided in this embodiment, it is further sort the forum index according to their relevance forum index the user's query terms, so that the user query the corresponding word forum clues row of Echizen, users can find their wish as soon as possible information queries, improve user satisfaction.

为了更加清楚的描述本发明实施例提供的技术方案的实现过程,本发明实施例进一步提供了搜索网页的方法实施例三,该实施例描述了从获取原始网页开始,到输出网页搜索结果的全部流程,如图六所示,包括: In order to more clearly describe the implementation process of the technical solution according to an embodiment of the present invention, embodiments of the present invention further provides search pages Example III This example describes starting from obtaining original page, all of the output of the search results page process, shown in figure VI, comprising:

步骤601、获得未处理的原始网页; Step 601, to obtain the original untreated web;

步骤602、使用预置的论坛线索模板库识别出该原始网页对应的论坛线索模板; Step 602, using the preset cue've template library recognize the original page template corresponding to've cue;

步骤603、从该原始网页中提取对应的论坛线索模板所标识的论坛线索; 在实际应用中,提取了论坛线索后可以将该信息保存至论坛线索数据库; 步骤604、对论坛线索进行词语切分和过滤,获得表示所述论坛线索的关键字集; Step 603, extracting cues Forum template corresponding to the identified've clues from the original page; In practice, the extract may be saved to the Forum clue clue forum database information; Step 604, the Forum of word segmentation cues and filtered to obtain clues indicating a forum set of keywords;

步骤605、统计关鍵字集中的关键字的TF和共现频率;步骤606、将论坛线索、关键字集中的关键字、关键字的TF和共现频率保存至索引数据库; Step 605, TF statistics focused keywords and keyword co-occurrence frequency; key step 606, the forum will focus on cue, keywords, keyword co-occurrence frequency TF and saved to the index database;

步骤607、获得用户查询词; In step 607, the user query to get the word;

步骤608、对用户查询词进行词语切分和过滤,获得查询关键字; 步骤609、从索引数据库中查找与查询关键字对应的论坛线索; 步骤610、对查询到的论坛线索进行格式化处理; 步骤611 、从索引数据库中获取查询关鍵字的TF和共现频率; Step 608, the user query word for word segmentation and filtration to obtain a query key; Step 609, the query to find clues forum from the index corresponding to the keyword database; step 610, the query to've formatting cues; step 611, acquiring the query keywords from the index database TF and co-occurrence frequency;

步骤612、统计查询关键字的IDF,计算TF/IDF,使用TF/IDF和共现频率计算查询关键字与论坛线索的相关度值; Step 612, statistical inquiry IDF keywords, calculate TF / IDF, using TF / IDF and the associated value co-occurrence frequency calculation query keywords Forum clues;

IDF是统计当前整个索引数据库中有多少个论坛线索包括该查询关键字; 步骤613、按相关度值排序输出格式化处理后的论坛线索; IDF is a statistical index throughout the current database clue how many forums, including the query keywords; step 613, sorted by relevance Forum clues output value after the formatting process;

使用本实施例,可以在获l^又原始网页后,确定原始网页对应的论坛线索, 提取相应的信息,获得表示论坛线索的关键字集,统计关键字集中的关键字的TF和共现频率,在用户查询关键字与该关键字集中的关键字对应时,可以确定该论坛线索符合用户的需要,当然在索引数据库中会有很多个符合用户需要的论坛线索,因而根据TF/IDF和共现频率得到每个论坛线索与用户查询关鍵字的相关度值,然后根据相关度值将论坛线索排序输出;使用户得到与用户查询关键字相关的论坛线索,并且论坛线索是根据相关度值排序的,相关度值越高的排在越前,使用户可以尽快的找到其想查询的信息,从而提高用户满意度。 After using this embodiment, and may be eligible for l ^ original page, the original page is determined corresponding to've clues, extracts the corresponding information, to obtain a set of keywords represents clues Forum, TF statistical keyword set keywords and co-occurrence frequency , when the corresponding user query keywords and the keyword set keywords, you can determine that the forum clue meet the needs of the user, of course, there will be many more in line with user needs Forum clues in the index database, thus according TF / IDF and co each forum to obtain occurrence frequency value associated cue the user query keywords, then the correlation values ​​in accordance with the sort output leads Forum; enables the user to obtain the user query keywords related forums clues, clue and forums correlation value according to Sort of, the higher the correlation values ​​of the row in front, so that users can find the information that you want to search as soon as possible, thereby increasing customer satisfaction.

本发明实施例提供搜索网页的装置70,如图7所示,包括: Embodiment of the present invention to provide apparatus to search the web 70, shown in Figure 7, comprising:

用户查询词获取单元701,用于获取用户查询词; User query word acquiring unit 701, to obtain a user query terms;

论坛线索查找单元702,用于从索引数据库32中查找与用户查询词对应的论坛线索; Forum clues searching unit 702, is used to find clues forum user query terms from the corresponding index database 32;

论坛线索输出单元703,用于对查询到的论坛线索进行格式化处理,并将格式化处理后的论坛线索输出给用户;使用本实施例提供的技术方案,可以根据用户的查询词给用户返回与查询词对应的论坛索引,从而使用户获得以论坛索引为单位的查询结果,而不会将一个论坛索引的多个网页分开处理,从而使返回给用户的查询结果更加准确。 Forum cue output unit 703, a query for a clue to the forum for formatting, and outputs the forum formatting cues to the user; technical solutions provided using the present embodiment, the user can return to the user's query word Forum index corresponding to the query words, so that users get query results to the forum index as a unit, not more than one page forum will be indexed separately, so returned to the user's query results more accurate.

进一步,在实际应用中,由于用户输入的用户查询词可能不符合关键字的要求,因而可以在搜索网页的装置实施例中进一步包括: Further, in practical applications, since the user query terms entered by the user may not meet the requirements of keywords, which can be implemented in case the device further includes a search page:

对用户查询词进行词语切分和过滤处理的查询关键字获取单元,从而获 User query words were words cut filter query key points and acquiring processing unit, thereby obtaining a

得查询关键字; Have to query keywords;

论坛线索查找单元,就可以根据查询关键字从索引数据库中查找与查询关键字对应的论坛线索;因查询关鍵字是通过用户查询词获取的,因而查找的论坛线索也与用户查询词对应; Forum clues searching unit, you can look for clues to the forum from the index corresponding keyword query the database based on the query keywords; because the keyword query by a user query words acquired, and thus find the forum clue word corresponding to the user query;

进一步,为了使用户能够尽快的查找到其想要的信息,可以对输出的论坛线索进行排序,因而还可以在搜索网页的装置实施例中包括: Further, in order to enable users to find the information they want as quickly as possible, it can sort of forum clue output, which can also be implemented in case includes device search page:

用于获取论坛线索中查询关键字排序信息的排序信息获取单元; Forum for acquiring clues query keywords ordering information ordering information acquisition unit;

排序信息可以是TF、和/或共现频率等,在获取了TF信息、共现频率等信息后,论坛线索输出单元,按照计算得到的TF/IDF值、或共现频率值、或计算得到的相关度值对论坛线索进行排序,并按照排序结果将论坛线索输出给用户;从而使与用户查询词越对应的论坛线索排的越前,使用户可以尽快的找到其想查询的信息,提高用户的满意度。 Sorting information could TF, and / or co-occurrence frequency, acquiring the TF information and the frequency information of co-occurrence, forums clues output unit, according to the calculated TF / the IDF value, or a co-occurrence frequency value, or calculated the correlation values ​​of the Forum clues are sorted and sorted according to the results of the forum cue output to the user; so that the user's query terms the corresponding forum clues row of Echizen, allowing users to find the information that you want to search as soon as possible to improve the user satisfaction.

本发明实施例提到的搜索网页的系统如图8所示,包括: Search the Web system embodiment of the present invention mentioned embodiment 8, comprising:

建立论坛线索数据库的装置801,用于获得未处理的原始网页;使用预置的论坛线索模板库识别出原始网页对应的论坛线索模板;从原始网页中提取论坛线索模板所标识的信息,信息包括论坛标识;在论坛线索数据库与论坛标识对应的表项中保存所述信息; Establishing a database 801 Forum clues for obtaining the original untreated web;'ve cue using the preset template library identifies the original page template corresponding to've cue;'ve clues template extracting information from the identified original page, the information comprising Forum identifier; and storing the information in the forum, the forum trail database entry corresponding to the identifier;

建立索引数据库的装置802,用于从论坛线索数据库中获取论坛线索标识对应的论坛线索;对论坛线索进行词语切分和过滤操作,获得表示论坛线索的关键字集;将论坛线索、关键字集保存至索引数据库;搜索网页的装置803,用于获得用户查询词;从索引数据库中查找与所 Means for establishing the index database 802, corresponding to the identifier for acquiring clues Forum've clues clues from the forum database;'ve cues for word segmentation and filtration to obtain a set of keywords represents clues Forum; the forum leads, keyword set save to index database; 803 device to search the web for obtaining user query terms; Find from the index database

述用户查询词对应的论坛线索;对查询到的所述论坛线索进行格式化处理, 并将格式化处理后的论坛线索输出。 Said user query clue word corresponding Forum; query leads to the forum for formatting, and outputs the processed Forum formatting cues.

使用本实施例提供的技术方案,可以根据用户的查询词给用户返回与查询词对应的论坛索引,从而使用户获得以论坛索引为单位的查询结果,而不会将一个论坛索引的多个网页分开处理,从而使返回给用户的查询结果更加准确。 Using the technical solution provided in the embodiment, the user can return to the forum with the query term corresponding to the index according to a user's query word, thereby allowing the user to obtain query results Forum index as a unit, rather than the plurality of pages will be indexed forum separately, so returned to the user's query results more accurate.

可以理解的是,可以将本发明实施例提供的搜索网页的方法、装置及系统应用到网页搜索引擎中,该网页搜索引擎可以是单一的论坛搜索引擎,也可以是综合搜索引擎,从而使搜索引擎可以在对论坛网页进行搜索时使用论坛线索为单位进行处理,提高搜索引擎所返回信息的准确性,提供用户满意度。 It will be appreciated that the methodology could search the page according to an embodiment of the present invention, apparatus and system is applied to a web search engine, the web search engine may be a single forum search engine, may be integrated search engine, so that the search engine can use the forum clues in the search for the forum pages are processed as a unit, to improve the accuracy of the information search engines return, provide customer satisfaction.

本领域普通技术人员可以理解实现上述实施例方法中的全部或部分步骤是可以通过程序来指令相关的硬件完成,所述的程序可以存储于一种计算机可读存储介质中,该程序在执行时,包括如下步骤: When ordinary skill in the art may understand that the above embodiments of the method steps may be all or part by a program instructing relevant hardware, the program may be stored in a computer-readable storage medium, the program execution comprising the steps of:

获得用户查询词; Access to the user query terms;

从索引数据库中查找与所述用户查询词对应的论坛线索; Find the clue word forum user query from the corresponding index database;

对查询到的所述论坛线索进行格式化处理,输出格式化处理后的论坛线 After the query to the forum for formatting cues, the output format processing line forum

索; Cable;

上述提到的存储介质可以是只读存储器,磁盘或光盘等。 The storage medium may be a read-only memory, magnetic or optical disk. 以上对本发明实施例所提供的搜索网页的方法、装置及系统和建立索? The method of searching the webpage provided by the above embodiment of the present invention, apparatus and system for establishing and cable? I 数据库的装置进行了详细介绍,以上实施例的说明只是用于帮助理解本发明的方法及其思想;同时,对于本领域的一般技术人员,依据本发明的思想, 在具体实施方式及应用范围上均会有改变之处,综上所述,本说明书内容不应理解为对本发明的限制。 Database means I were described in detail, the above described embodiments merely for understanding the method and idea of ​​the present invention; while those of ordinary skill in the art, according to the ideas of the present invention, specific embodiments and applications in may make modifications to the place, of the specification shall not be construed as limiting the present invention.

Claims (25)

1、一种搜索网页的方法,其特征在于,包括: 获得用户查询词; 从预置索引数据库中查找与所述用户查询词对应的论坛线索; 对查询到的所述论坛线索进行格式化处理,输出格式化处理后的论坛线索。 1. A method of searching a web page, characterized by comprising: obtaining a user query terms; Find a group cue to the user corresponding to the query word from the preset index database; query to the forum for formatting cues after the output format of the Forum clues.
2、 如权利要求1所述的搜索网页的方法,其特征在于,获得用户查询词后进一步包括:对所述用户查询词进行预处理,获得查询关键字;所述从预置索引数据库中查找与所述用户查询词对应的论坛线索具体为:根据所述查询关键字从索引数据库中查找与所述用户查询词对应的论坛线索。 2. The method of claim 1 to search the web, characterized in that, after obtaining the user query terms further comprising: preprocessing the user query word to obtain a query key; Find the index database from the preset the user query words corresponding to specific clues forum: Forum to find clues to the user query terms from the corresponding index database according to the query keywords.
3、 如权利要求2所述的搜索网页的方法,其特征在于,输出格式化处理后的论坛线索前进一步包括:获取所述论坛线索中所述查询关键字的排序信息;按照所述排序信息排序输出所述格式化处理后的论坛线索。 3. The method of claim 2 search the web, characterized in that, prior to the Forum clues outputting formatted further comprising: obtaining the forum ordering information in the query keyword cue; information according to the ordering sorting the formatted output Forum cue processed.
4、 如权利要求3所述的搜索网页的方法,其特征在于,若所述排序信息为单文本词汇频率,所述按照排序信息排序输出所述格式化处理后的论坛线索具体为:统计与所述单文本词汇频率对应的逆文本频率;按照所述单文本词汇频率与逆文本频率的比值排序,输出所述格式化处理后的论坛线索。 4. The method of claim 3 search page, characterized in that, if the ranking information is a single text word frequency, the forum leads the formatting process according to the ordering information for ordering output is specifically: Statistics and the single text word corresponding to the frequency inverse document frequency; sorting according to the ratio of the single text word frequency and inverse document frequency, and outputs the formatted Forum cue processed.
5、 如权利要求3所述的搜索网页的方法,其特征在于,若所述排序信息为共现频率,:接照所述共现频率排序输出所述格式化处理后的论坛线索。 5. The method of claim 3 search page, characterized in that, if the ranking information is a co-occurrence frequency: then sorted according to the frequency of the output cue Forum cooccurrence after the formatting process.
6、 如权利要求3所述的搜索网页的方法,其特征在于,若所述排序信息为单文本词汇频率和共现频率,所述按照排序信息排序输出所述格式化处理后的论坛线索具体为:将所述单文本词汇频率和共现频率按预置算法,计算所述查询关键字与所述论坛线索的相关度值;按照所述相关度值排序输出所述格式化处理后的论坛线索。 6. The method as claimed in claim 3, search page, wherein, if the ranking information is a single text word frequency and co-occurrence frequency of the group after the formatting process threads Sort Sort output specific information to: the single text word frequency and the frequency of co-occurrence correlation values ​​according to a preset algorithm, and calculates the query key clues to the forum; forum after the formatting process according to an output of the correlation values ​​are sorted clue.
7、 如权利要求l所述的搜索网页的方法,其特征在于,所述索引数据库通过如下流程建立:从论坛线索数据库中获取论坛线索标识对应的论坛线索; 对所述论坛线索进行预处理,获得表示所述论坛线索的关键字集;将所述论坛线索和所述关键字集对应保存至索引数据库。 7. The method of searching for a web page according to claim l, wherein the index database is established by the following scheme: acquiring clues to identify the corresponding've Forum clues clues forum database; the forum pretreated leads, get a clue forum representing the set of keywords; the Forum clues and save the keyword set corresponding to the index database.
8、 如权利要求7所述的搜索网页的方法,其特征在于,进一步统计所述关键字集中关键字的共现频率;进一步在所述索引数据库中保存所述共现频率。 8. The method of claim 7, search page, characterized in that the further statistical co-occurrence frequency of the keyword set of keywords; further retains the co-occurrence frequency in the index database.
9、 如权利要求7或8所述的搜索网页的方法,其特征在于,进一步统计所述关键字集中关键字的单文本词汇频率;进一步在所述索引数据库中保存所述单文本词汇频率。 9. The method of claim 7 or claim 8 search page, wherein the further statistical frequency vocabulary keyword set text single keyword; single further save the text in the word frequency index database.
10、 如权利要求7或8所述的搜索网页的方法,其特征在于,所述论坛线索数据库采用如下流程建立:获得未处理的原始网页;使用预置的论坛线索模板库识别出所述原始网页对应的论坛线索模板;从所述原始网页中提取所述论坛线索模板所标识的信息,所述信息包括论坛标识;在论坛线索数据库与所述论坛标识对应的表项中保存所述信息。 10. A method as claimed in claim 7 or 8 to search the web, characterized in that the process follows the forum database cue established: obtaining unprocessed original page;'ve using the preset cue identifying the original template library web page template corresponding to've cue; said forum information extracting cues from the template identified in the original page, the identification information including the Forum; the information stored in the forum, the forum trail database entry corresponding to the identifier.
11、 如权利要求IO所述的搜索网页的方法,其特征在于,进一步从所述原始网页中提取所述原始网页对应的原始论坛线索标识;在论坛线索数据库与所述论坛标识对应的表项中保存所述信息前进一步包括:从论坛线索数据库查找与所述论坛标识和所述原始论坛线索标识对应的表项,在与所述论坛标识和所述原始论坛线索标识对应的表项中保存所述信息。 The method of searching the web 11, the IO as claimed in claim, wherein said further extracted original page identifier corresponding to the original forum cues from the original page;'ve clues to the forum database entry corresponding to the identifier in the stored information before further includes: save the database to find clues from the forum and the forum and the original forum to identify clues to identify the corresponding entry in the forum and the original forum to identify clues to identify the corresponding entry said information.
12、 如权利要求11所述的搜索网页的方法,其特征在于,在与所述论坛标识和所述原始论坛线索标识对应的表项中保存所述信息前进一步包括:判断与所述原始论坛线索标识对应的表项是否存在,如果是,进入在与所述论坛标识和所述原始论坛线索标识对应的表项中保存所述信息的步骤; 如果否,在所述论坛线索数据库中新建与所述论坛标识和所述原始论坛线索标识对应的表项,进入在与所述论坛标识和所述原始论坛线索标识对应的表项中保存所述信息的步骤。 12. The method of claim 11, wherein the search page, characterized in that the identifier stored in the forum, and the corresponding original identifier forum cues in the front entry information further comprises: determining the original Forum cues identifying the corresponding entry exists, and if so, storing the information entered in step with the identification and the original forum Forum clues in identifying the corresponding entry; if not, a new trail in the database and forums the forum ID and the entry corresponding to the identifier of the original cues Forum, proceeds to step storing the information on the identification and the original forum Forum clues in identifying the corresponding entry.
13、 如权利要求7或8所述的搜索网页的方法,其特征在于,对所述论坛线索进行预处理,获得表示所述论坛线索的关键字集具体为:对所述论坛线索进行词语拆分和/或过滤,获得表示所述论坛线索的关键字集。 13. The method of claim 7 or claim 8 search pages, characterized in that the pretreatment leads forum, the forum represents obtain clues to the specific set of keywords: the words split cues forum points and / or filtration to obtain clues indicating a forum set of keywords.
14、 一种建立论坛线索数据库的装置,其特征在于,包括: 原始网页获取单元,用于获取未处理的原始网页;论坛线索模板识别单元,用于使用预置的论坛线索模板库识别出所述原始网页对应的论坛线索模板;信息提取单元,用于从所述原始网页中提取所述论坛线索模板所标识的信息,所述信息包括论坛标识;信息保存单元,用于在论坛线索数据库与所述论坛标识对应的表项中保存所述信息。 14. An apparatus for establishing a database Forum leads, characterized in comprising: an original Web page acquiring unit, configured to obtain the original untreated web;'ve clue template identification unit for identifying the forum leads template library used preset said original page template corresponding to've cue; information extracting means for extracting the information of the identified template've cue from the original page, the identification information including the Forum; information storage means for the forum database cues the forum entries corresponding to the identifier in the stored information.
15、 如权利要求14所述的建立论坛线索数据库的装置,其特征在于,进一步包括:原始论坛线索标识获取单元,用于从所述原始网页中提取所述原始网页对应的原始论坛线索标识;表项查找单元,用于从论坛线索数据库查找与所述论坛标识和所述原始^仑坛线索标识对应的表项;所述信息保存单元用于在与所述论坛标识和所述原始论坛线索标识对应的表项中保存所述信息。 15. The apparatus as claimed in database established the Forum cue claim 14, characterized in that, further comprising: identifier obtaining unit cue original forum, the forum for the original extract cues from the original page identifier corresponding to the original page; entry lookup unit, leads to the forum with the forum database lookup identifier and the identifier of the original ^ lun altar clues from the corresponding entry; in the information holding unit for identifying the forum and the original forum cues identifying the corresponding entry in the stored information.
16、 如权利要求15所述的建立论坛线索数据库的装置,其特征在于,若所述表项查找单元没有查找到所述与所述论坛标识和所述原始论坛线索标识对应的表项,进一步包括:表项建立单元,用于在所述论坛线索数据库中新建与所述论坛标识和所述原始论坛线索标识对应的表项。 16. The apparatus as claimed in the database established the Forum clues in claim 15, wherein, if the searching unit does not find the entry to the forum with the identifier and the identifier corresponding to the original cue forum entries, further comprising: establishing unit entry, the entry for the original've cues corresponding to the identifier in the new forum, the forum trail database and the identifier.
17、 一种建立索引数据库的装置,其特征在于,包括:论坛线索获取单元,用于从论坛线索数据库中获取论坛线索标识对应的论坛线索;关键字集获取单元,用于对所述论坛线索进行预处理,获得表示所述论坛线索的关键字集;信息保存单元,用于将所述论坛线索和所述关键字集对应保存至索引数据库。 17. An apparatus for establishing a database index, which is characterized in that, comprising: an acquisition unit clues forums, forum clues for acquiring clues to identify the corresponding've Forum cue database; keyword set obtaining unit, configured to cue the forum pretreatment obtain clues representing the set of keywords Forum; information storage unit, corresponding to said index database to save cues Forum and the set of keywords.
18、 如权利要求17所述的建立索引数据库的装置,其特征在于,还包括: 共现频率统计单元,用于统计所述关键字集中关键字的共现频率; 所述信息保存单元还用于将所述共现频率保存至所述索引数据库。 18. The apparatus as claimed in indexing the database in claim 17, characterized in that, further comprising: a co-occurrence frequency count means for counting the keyword set of the co-occurrence frequency of the keyword; the information storage unit is further used to the co-occurrence frequency stored into the index database.
19、 如权利要求17或18所述的建立索引数据库的装置,其特征在于, 还包括:单文本词汇频率统计单元,用于统计所述关键字集中关键字的单文本词汇频率;所述信息保存单元还用于将所述单文本词汇频率保存至所述索引数据库。 19. The apparatus of indexing a database in claim 17 or 18, characterized in that, further comprising: a single text word frequency statistics unit for counting the single keyword set keyword text word frequency; said information storage unit is further configured to store the single text word frequency index to the database.
20、 一种搜索网页的装置,其特征在于,包括: 用户查询词获取单元,用于获取用户查询词;论坛线索查找单元,用于从索引数据库中查找与所述用户查询词对应的论坛线索;论坛线索输出单元,用于对查询到的所述论坛线索进行格式化处理,将格式化处理后的论坛线索输出给用户。 20, an apparatus for searching the Web, characterized by comprising: a user query word acquiring unit, configured to obtain the user query word;'ve clues searching unit configured to search query term corresponding to the user from the forum database index clues ; forum leads an output unit configured to query the forum for formatting cues, the cue to the Forum formatted output to the user.
21、 如权利要求20所述的搜索网页的装置,其特征在于,进一步包括: 查询关键字获取单元,用于对所述用户查询词进行预处理,获得查询关键字;所述论坛线索查找单元,用于根据所述查询关键字从索引数据库中查找与所述用户查询词对应的论坛线索。 21. The apparatus according to claim 20 searches web, characterized in that, further comprising: a query key acquiring unit, configured to preprocess the user query word to obtain a query key; the forum cue searching unit for search and query the user forum clue word from the corresponding index database according to the query keywords.
22、 如权利要求21所述的搜索网页的装置,其特征在于,进一步包括:排序信息获取单元,用于获取所述论坛线索中所述查询关键字的单文本词汇频率;计算单元,用于采用统计得到的与所述单文本词汇频率对应的逆文本频率,计算所述单文本词汇频率与逆文本频率的比值;所述论坛线索输出单元,用于按照所述单文本词汇频率与逆文本频率的比值排序输出所述格式化处理后的论坛线索。 22. The apparatus of claim 21, search the web, characterized in that, further comprising: sorting information acquisition unit for acquiring clues in the forum of the single text query key word frequency; calculating means for inverse document frequency of the use of single word text corresponding to the frequency of the counted, calculating the ratio of the single text word frequency and inverse document frequency; the forum cue output unit according to the inverse document frequency of the word of the single text the ratio of the output frequency sort formatting cues processed forum.
23、 如权利要求21所述的搜索网页的装置,其特征在于,进一步包括: 排序信息获取单元,用于获取所述论坛线索中所述查询关键字的共现频率;所述论坛线索输出单元,用于按照所述共现频率排序输出所述格式化处理后的论坛线索。 23. The apparatus of claim 21, search the web, characterized in that, further comprising: sorting information acquisition unit for acquiring clues in the forum of the co-occurrence frequency of the keyword query; the output unit cue Forum , Forum for sorting output leads according to the frequency of the co-occurrence of the formatting process.
24、 如权利要求21所述的搜索网页的装置,其特征在于,进一步包括: 排序信息获取单元,用于获取所述论坛线索中所述查询关键字的单文本词汇频率和共现频率;相关度值计算单元,用于根据所述单文本词汇频率和共现频率,采用顸置算法算出所述查询关键字与所述论坛线索的相关度值;所述论坛线索输出单元,用于按照所述相关度值排序输出所述格式化处理后的论坛线索。 24. The apparatus of claim 21, search the web, characterized in that, further comprising: sorting information acquisition unit for acquiring clues in the forum of the single text query key word frequency and co-occurrence frequency; Related value calculation unit for the single text word frequency and co-occurrence frequency is calculated using the algorithm Han set value of the relevant key and the Forum cue; said forum cue output unit according to the said output of said correlation values ​​are sorted formatting cues processed Forum.
25、 一种搜索网页的系统,其特征在于,包括:建立论坛线索数据库的装置,用于获取未处理的原始网页;使用预置的论坛线索模板库识别出所述原始网页对应的论坛线索模板;从所述原始网页中提取所述论坛线索模板所标识的信息,所述信息包括论坛标识;在论坛线索数据库与所述论坛标识对应的表项中保存所述信息;建立索引数据库的装置,用于从所述论坛线索数据库中获取论坛线索标识对应的论坛线索;对所述论坛线索进行预处理,获得表示所述论坛线索的关键字集;将所述论坛线索和所述关键字集对应保存至索引数据库;搜索网页的装置,用于获得用户查询词;从所述索引数据库中查找与所述用户查询词对应的论坛线索;对查询到的论坛线索进行格式化处理,将格式化处理后的论坛线索输出。 25. A web search system, characterized by comprising: establishing means clues forum database, for obtaining the original untreated web;'ve cue using the preset template library identifying the original page template corresponding to've clues ; means indexing database; cue information of the identified template've extracted from the original page, the identification information including the Forum; the information stored in the forum, the forum trail database entry corresponding to the identifier Forum clues for acquiring clues to identify the corresponding've cue from the forum database; the forum leads pretreated obtain clues representing the set of keywords Forum; the forum and leads corresponding to the set of keywords save to index database; device to search the web for obtaining user query word; look for clues to the forum user query terms from the corresponding index database; the query to the forum clues for formatting, the formatting process Forum clues outputs.
CN 200710136345 2007-07-24 2007-07-24 Method, device and system for searching web page and device for establishing index database CN100478962C (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN 200710136345 CN100478962C (en) 2007-07-24 2007-07-24 Method, device and system for searching web page and device for establishing index database

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN 200710136345 CN100478962C (en) 2007-07-24 2007-07-24 Method, device and system for searching web page and device for establishing index database

Publications (2)

Publication Number Publication Date
CN101101605A CN101101605A (en) 2008-01-09
CN100478962C true CN100478962C (en) 2009-04-15

Family

ID=39035877

Family Applications (1)

Application Number Title Priority Date Filing Date
CN 200710136345 CN100478962C (en) 2007-07-24 2007-07-24 Method, device and system for searching web page and device for establishing index database

Country Status (1)

Country Link
CN (1) CN100478962C (en)

Families Citing this family (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101639831B (en) 2008-07-29 2012-09-05 华为技术有限公司 Search method, search device and search system
WO2010014954A2 (en) * 2008-08-01 2010-02-04 Google Inc. Providing posts to discussion threads in response to a search query
CN102737042B (en) * 2011-04-08 2015-03-25 北京百度网讯科技有限公司 Method and device for establishing question generation model, and question generation method and device
CN102317943B (en) * 2011-07-29 2013-10-02 华为技术有限公司 Method and device for full-text search
CN102831186A (en) * 2012-08-02 2012-12-19 深圳市同洲电子股份有限公司 Method and device for storing and searching webpage
CN103581280B (en) * 2012-08-30 2017-02-15 网易传媒科技(北京)有限公司 Method and device for interface interaction based on micro blog platform
WO2014132265A2 (en) * 2013-02-14 2014-09-04 Gyan Prakash Kesarwani An improved system and method of scanning a search engine depending on the importance of the keywords and producing an effective output
CN104951449A (en) * 2014-03-26 2015-09-30 腾讯科技(深圳)有限公司 Data processing method and device
CN105912545A (en) * 2015-12-15 2016-08-31 乐视网信息技术(北京)股份有限公司 Device, method, and system for media resource retrieval

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1471030A (en) 2002-06-28 2004-01-28 微软公司 System and method of automatic example sentence search based on weighted editing distance
CN1227611C (en) 2001-03-09 2005-11-16 北京大学 Method for judging position correlation of a group of query keys or words on network page

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1227611C (en) 2001-03-09 2005-11-16 北京大学 Method for judging position correlation of a group of query keys or words on network page
CN1471030A (en) 2002-06-28 2004-01-28 微软公司 System and method of automatic example sentence search based on weighted editing distance

Also Published As

Publication number Publication date
CN101101605A (en) 2008-01-09

Similar Documents

Publication Publication Date Title
CA2813644C (en) Phrase-based searching in an information retrieval system
Glance et al. Blogpulse: Automated trend discovery for weblogs
US8495049B2 (en) System and method for extracting content for submission to a search engine
US7577643B2 (en) Key phrase extraction from query logs
US7536408B2 (en) Phrase-based indexing in an information retrieval system
US6970863B2 (en) Front-end weight factor search criteria
US7636714B1 (en) Determining query term synonyms within query context
US7930286B2 (en) Federated searches implemented across multiple search engines
US7580929B2 (en) Phrase-based personalization of searches in an information retrieval system
Becker et al. Identifying content for planned events across social media sites
US7711679B2 (en) Phrase-based detection of duplicate documents in an information retrieval system
US20060018551A1 (en) Phrase identification in an information retrieval system
JP5575902B2 (en) Information retrieval based on query semantic patterns
US20100268720A1 (en) Automatic mapping of a location identifier pattern of an object to a semantic type using object metadata
CN100416570C (en) FAQ based Chinese natural language ask and answer method
KR101450358B1 (en) Searching structured geographical data
KR101176079B1 (en) Phrase-based generation of document descriptions
JP2008508575A (en) Aggregation and search methods using ecosystems and related technologies
US20130246440A1 (en) Processing a content item with regard to an event and a location
KR101527259B1 (en) Providing posts to discussion threads in response to a search query
US20130332460A1 (en) Structured and Social Data Aggregator
US20030135430A1 (en) Method and apparatus for classification
US8756245B2 (en) Systems and methods for answering user questions
US8554854B2 (en) Systems and methods for identifying terms relevant to web pages using social network messages
Li et al. Tag-based social interest discovery

Legal Events

Date Code Title Description
C06 Publication
C10 Request of examination as to substance
C14 Granted
ASS Succession or assignment of patent right

Owner name: BEIJING JINGDONG SHANGKE INFORMATION TECHNOLOGY CO

Free format text: FORMER OWNER: HUAWEI TECHNOLOGY CO., LTD.

Effective date: 20150619

C41 Transfer of the right of patent application or the patent right