CN101246501B - Method and system for polymerizing the same subject network document files - Google Patents

Method and system for polymerizing the same subject network document files Download PDF

Info

Publication number
CN101246501B
CN101246501B CN 200810088055 CN200810088055A CN101246501B CN 101246501 B CN101246501 B CN 101246501B CN 200810088055 CN200810088055 CN 200810088055 CN 200810088055 A CN200810088055 A CN 200810088055A CN 101246501 B CN101246501 B CN 101246501B
Authority
CN
Grant status
Grant
Patent type
Prior art keywords
document
network
same
word
words
Prior art date
Application number
CN 200810088055
Other languages
Chinese (zh)
Other versions
CN101246501A (en )
Inventor
唐年鹏
王志平
Original Assignee
腾讯科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Grant date

Links

Abstract

The invention relates to a method for gathering web documents with the same theme, comprising: obtaining weight value of each word in a current web document, selecting two or more words with greater weight value orderly to compose a term, searching web documents with the same subject through the composite term until quantity of web documents with the same subject searched by a certain term exceeds a preset value, gathering the current web document and the web documents with the same theme. The invention also discloses a system for gathering web documents with the same theme. The invention solves the problem that the data quantity to be processed for gathering web documents with the same subject causes low web updating speed in current technique, which influence on experiencing of user. The invention is capable of improving web updating speed and improving experiencing of user.

Description

一种聚合相同主题网络文档的方法及系统 A method of polymerization with the same theme document and network systems

技术领域 FIELD

[0001] 本发明涉及网络文档聚合领域,特别是涉及一种聚合相同主题网络文档的方法及系统。 [0001] The present invention relates to network documentation polymerization art, particularly to a method and system for web documents relating to the same polymerization. 背景技术 Background technique

[0002] 在网络上,将相同主题的网络文档聚合在一起,提供给用户,便于用户对该主题的相关内容进行全面、细致的了解,是网络服务的一项重要内容。 [0002] on the network, the network will document the same theme of coming together, available to users to facilitate user-related content on the subject of a comprehensive and detailed understanding is an important network services. 现有技术中,许多网站主要通过编辑,人工整理主题相同的网络文档,但人力毕竟有限,面对数据量庞大的网络资源,人工编辑显然无法全面、及时地整理相同主题的网络文档。 The prior art, many sites primarily through editing, finishing the same theme artificial network documentation, but manpower is limited, the face of the huge amount of data network resources, human editors apparently unable to fully and timely to organize a network document with the same theme. 目前,部分大型网站采用传统的分类和聚类方法,聚合相同主题的网络文档。 Currently, some large sites using the traditional method of classification and clustering, converged network document with the same theme.

[0003] 参阅图l,示出现有聚合相同主题网络文档的方法,具体包括以下步骤。 [0003] Referring to Figure L, the polymerization method is same subject document illustrating network, includes the following steps.

[0004] 步骤S101、将网络文档按主题所属类别分类,为各类网络文档分别设置关键词库。 [0004] step S101, the network owned by topic document classification, keyword database set up for all types of network documentation, respectively.

关键词库中的关键词集中反映该类网络文档的特性。 Keywords library Keywords reflects the characteristics of such a network document. 例如,以某明星为主题的网络文档,其 For example, with a star-themed web documents, which

关键词库包括该明星的姓名、主要歌曲名、主演电影名等词语。 Keywords library includes the words of the star's name, the main song name, starred in the movie names.

[0005] 步骤S102、对新查找到的网络文档,提取该网络文档中全部词语,组成关键词库。 [0005] step S102, the new network documentation to find, extract all the words of the document network, consisting keyword database. [0006] 步骤S103、将新查找网络文档的关键词库,与各类网络文档的关键词库进行匹配,选出词语匹配度最大的一个类别,新查找的网络文档与该类网络文档主题相同。 [0006] step S103, the new library will search for web documents, matching the keyword database of various types of web documents, select the largest word matching a category, find the same new class of network documentation and network documentation topic . 例如,新查找的网络文档为有关"911"事件的报道,关键词库包括"9月11"、"恐怖分子"、"飞机"、"世贸大厦"等词语。 For example, to find a new network documentation to reports of "911", the keyword library includes the words "September 11", "terrorists", "airplane", "World Trade Center" and so on. 而"911"事件类网络文档的关键词库也会包含上述各词语,因此,这两个关键词库的词语匹配度就会相对较高。 The "911" class keyword web document library will contain the above words, therefore, these two words matching keyword database will be relatively high.

[0007] 步骤S104、将新查找的网络文档聚合到该类网络文档。 [0007] In step S104, a new look to web documents such polymeric web document.

[0008] 上述方法虽然能够较好地将新查找的网络文档聚合到相同主题的网络文档,但需要对检索到的每一篇网络文档,都整理成关键词库,再与各类网络文档的关键词库匹配,网络文档一般需要细分为多个分类,这样需处理的数据量过大,造成网络更新速度慢,影响用户的体验感。 [0008] While the above method better able to find the new network documentation aggregated into a web document with the same theme, but each requires a network to retrieve documents are organized into keyword database, and then with all kinds of web documents keywords library matching, network documentation generally requires subdivided into a plurality of classification so that data to be processed is too large, resulting in slower network updates, user experience a sense of influence.

[0009] 上述方法在判断时,主要依据关键词库内的关键词,如关键词选择不当,或主题相近网络文档的关键词库中关键词大部分相同,很容易造成误判,不能准确聚合主题相同的网络文档,降低用户的体验感。 [0009] In determining the above methods, mainly based on keyword library keywords, such as keyword selection improper, keyword or theme similar to most of the same document library network keywords can easily lead to miscarriage of justice, can not be accurately polymerization the same thematic networks documents, reducing the sense of the user's experience.

发明内容 SUMMARY

[0010] 本发明所要解决的技术问题是提供一种聚合主题相同网络文档的方法,以解决现 [0010] The present invention solves the technical problem is to provide a process for the polymerization of documents relating to the same network, in order to solve the current

有技术中聚合主题相同网络文档,需处理的数据量过大,造成网络更新速度慢,影响用户的体验感的问题。 Art theme polymerization same network documentation, data to be processed is too large, causing network slow renewal rate, the impact of user experience a sense of. 该方法可提高网络更新速度,提高用户的体验感。 This method can improve the speed of the network to update and improve the user experience a sense of.

[0011 ] 本发明的另一个目的是提供一种聚合主题相同网络文档的系统,该系统能够提高 [0011] Another object of the present invention is to provide a network system relating to the same document the polymerization, the system can be improved

网络更新速度,提高用户的体验感。 Network update speed and improve the user experience a sense of.

[0012] 本发明一种聚合相同主题网络文档的方法,包括:获取当前网络文档中各词语的 [0012] The present invention relating to a method for the same document polymeric network, comprising: obtaining each of the words in the document, the current network

4权重值,将上述各词语按权重值降序排列;从首个词语开始,依次将上一词语及同该词语相 4 weight value, each of the words in descending order by weight values; starting from the first words, and the words in turn on one of the words with the same

临的下一词语组成检索词,利用组成的检索词检索相同主题网络文档,直至某个检索词检 The next words composed Pro search terms using the search terms to retrieve the same thematic networks consisting of documents, until a search term inspection

索的相同主题网络文档数量超过预置数值,聚合上述当前网络文档和相同主题网络文档。 The same number of cable exceeds a preset value relating to web documents, web documents polymerization of the current network and relating to the same document.

[0013] 优选的,聚合上述当前网络文档和相同主题网络文档之前,还包括:使用哈希表表 Before [0013] Preferably, the polymerization of the current network and document web documents relating to the same, further comprising: a hash table entry

示当前网络文档和相同主题网络文档中各词语的向量值,依据所述各词语的向量值计算所 It shows the current value and the same web document relating to the words of each web documents, calculated according to the magnitude of each of the words

述相同主题网络文档与当前网络文档的相关度值,去除相关度值低于预设数值的相同主题 Said web documents relating to the same value with the current network-related documents, relating to the removal of the same correlation value is lower than a preset value

网络文档。 Network documentation.

[0014] 优选的,依据所述各词语的向量值计算所述相同主题网络文档与当前网络文档的相关度值具体为,按出现频次升序排列当前网络文档和相同主题网络文档中的各词语,将相同主题网络文档中的各词语的向量值,与对应的当前网络文档的各词语的向量值分别相乘,获得的积相加,作为第一数据,将相同主题网络文档中各词语的向量值分别平方后,相加;将当前网络文档中各词语的向量值分别平方后,相加;将上述计算的和相乘,再开方,作为第二数据,所述第一数据除以所述第二数据的商,作为相同主题网络文档与当前网络文档的相关度值。 [0014] Preferably, according to the calculated value of each of the words in the web documents relating to the same correlation with the current value of particular Web documents, appear in ascending order according to the frequency arrangement of each web document and words relating to the same network in the current document, the vector value is multiplied with the respective words in a web document corresponding to the current value, adding the product obtained in each of the words relating to web documents in the same, as the first data, relating to the same network a vector of each document words after the values ​​are squared, the addition; after the magnitude of the current network to the respective words in the document are squared, summed; and multiplying the calculated above, then the square root, as the second data, the first data divided by the Suppliers of said second data, relating to the same network as documents related to the current value of the network document.

[0015] 优选的,获取当前网络文档中每个词语的权重值具体为,在当前网络文档中统计各词语的出现频次,获取各词语命中的索引文档数量和总索引文档数量,将总索引数量除以该词语命中的索引数量,再取对数,得到的数值乘以上述出现频次,得到该词语的权重值。 [0015] Preferably, obtain the current network document weight value of each word in particular, the statistics appear each word frequency of the current network documentation, obtaining the number of index number of documents and the total index documents each word hit, the total number of indexes divided by the number of indexes of the word hits, and then taking the logarithm of the value obtained by multiplying the frequency of occurrence above, to obtain weight values ​​of the term.

[0016] 优选的,在当前网络文档中统计各词语的出现频次具体为,获取该词语在当前网络文档中出现的位置,及在该位置的出现次数,将词语在该位置的出现次数乘以该位置对应系数,乘积相加后作为该词语的出现频次。 [0016] Preferably, the statistical occurrence of each word in the frequency specifically, acquiring the position of the word appears in the current network document, and the number of occurrences of the position of the word multiplied by the number of occurrences of the position of the current network document this position corresponds to the coefficient, the products are added as the frequency of occurrence of the word.

[0017] 优选的,在当前网络文档中统计各词语的出现频次具体为,统计词语在当前网络文档中出现次数,判断该词语是否在网络文档主题位置出现,如是,在该词语总出现次数上加设定数值,作为该词语的出现频次。 [0017] Preferably, the number of statistical occurrence of each word in the frequency Specifically, the statistical word occurrences in the current network documentation, it is determined that the word appears in the web documents relating to the position, and if so, always appears in the word in the current network document plus the set value, as the frequency of occurrence of the word.

[0018] 本发明一种聚合相同主题网络文档的系统,包括权重值计算模块、检索词组成模 [0018] The present invention is a polymeric network system relating to the same document, includes a weight calculation module, the search-word mode

块、网络文档检索模块、及聚合模块:所述权重值计算模块,用于获取当前网络文档中各词 Block, the network document retrieval module, and aggregation module: the weight calculation module, configured to obtain a current network documentation of each word

语的权重值;所述检索词组成模块,用于将上述各词语按权重值降序排列,从首个词语开 Weight value weight language; the search-word block, for each of the words in descending order according to the weight value, the first words from the open

始,依次将上一词语及同该词语相临的下一词语组成检索词;所述网络文档检索模块,用于 Beginning sequentially on a word and the next adjacent word with the word composed of the search word; the network document retrieval means for

利用组成的检索词检索相同主题网络文档,直至某个检索词检索的相同主题网络文档数量 Using the search terms found in the same thematic networks consisting of documents, until the number of the same theme of a search term web documents retrieved

超过预置数值;所述聚合模块,用于聚合上述当前网络文档和相同主题网络文档。 Exceeds a preset value; the aggregation module, the current network for the polymerization of the above documents relating to the same web documents.

[0019] 优选的,还包括向量值模块,相关度计算模块、去除模块:所述向量值模块,用于 [0019] Preferably, the module further comprises a vector value, the correlation calculation module, the module is removed: the value of the module, for

使用哈希表表示当前网络文档和相同主题网络文档中各词语的向量值;所述相关度计算 Use a hash table to indicate the current value and the same web document relating to the words of each web documents; the correlation degree calculating

模块,用于依据所述各词语的向量值计算所述相同主题网络文档与当前网络文档的相关度 Means for calculating the magnitude of each of the words according to the network relating to the same document with the document affinity current network

值;所述去除模块,用于去除相关度值低于预设数值的相同主题网络文档。 Value; the removing module for removing the web documents relating to the same correlation value is lower than a preset value.

[0020] 与现有技术相比,本发明具有以下优点: [0020] Compared with the prior art, the present invention has the following advantages:

[0021] 本发明组合当前网络文档中权重值较高的词语作为检索词,检索相同主题网络文档,因权重值高的词语,具有很强的代表性,能够很好的反应当前网络文档的特性。 High [0021] The compositions of the present invention, the current network document weighting values ​​words as the search term, the search web documents relating to the same, due to the high weight value words, have a strong representation, it can be a good response characteristics of the current network document . 由两个或两个以上权重值较高的词语组成的检索词检索到的网络文档,与当前网络文档同主题的可能性非常大。 Search terms by two or more than two higher weight value words consisting of a network to retrieve documents, and the possibility of the current document with the theme of the network is very large. 本发明在选取同主题网络文档的过程中,只需选取合适的词语组成检索词检索,相对与图1所示的现有技术,本发明不需将查找的各种网络文档与各类主题的网络文档一一对比,需处理的数据量较小,在应用过程中,网络更新速度快,有利于提高用户的体验感。 In the process of the present invention relating to the network with the selected document, simply select the appropriate words consisting the search terms, as opposed to the prior art shown in Figure 1, the present invention does not need to look for a variety of web documents with various topics web documents-one comparisons, the smaller the amount of data to be processed in the application process, fast update speed network, will help improve the user experience a sense of.

附图说明 BRIEF DESCRIPTION

[0022] 图1为现有聚合相同主题网络文档的方法流程图;[0023] 图2为本发明聚合相同主题文档的方法第一实施例流程图;[0024] 图3为本发明计算当前网络文档中各词语的权重值的方法流程图;[0025] 图4为本发明聚合网络相关文档的方法第二实施例流程图;[0026] 图5为本发明聚合相同主题网络文档的系统第一实施例示意图;[0027] 图6为本发明检索词组成模块结构示意图; [0022] FIG. 1 is a flow chart relating to the same conventional polymerization method of a network document; [0023] FIG 2 is a flowchart illustrating a first embodiment of a method of the invention, the polymerization the same subject document; [0024] FIG. 3 of the present invention calculates the current network the method of document weight value for each word of a flowchart; flowchart second embodiment of network-related documents polymerization [0025] FIG. 4 of the present invention; [0026] FIG. 5 is the same polymerization system of the first invention relating to network documentation Example schematic embodiment; [0027] Fig 6 a schematic block search word structures of the present invention;

[0028] 图7示出本发明聚合相同主题网络文档的系统第二实施例示意图。 [0028] FIG. 7 shows a schematic view of a second embodiment of web documents relating to the same embodiment the polymerization system of the present invention. 具体实施方式 detailed description

[0029] 为使本发明的上述目的、特征和优点能够更加明显易懂,下面结合附图和具体实施方式对本发明作进一步详细的说明。 [0029] For the above-described objects, features and advantages of the invention more apparent, the accompanying drawings and the following specific embodiments of the present invention will be further described in detail.

[0030] 本发明将当前网络文档中权重值较高的词语组成检索词,利用检索词检索与当前网络文档主题相同的网络文档,将检索到的网络文档和当前网络文档聚合。 [0030] In the present invention, the current network weights higher weight value document composition word search term, using the same current search word retrieving web documents relating to web documents, the retrieved web document and the current document polymeric network. 本发明聚合相同主题网络文档的方法可应用于多个相关领域,方便用户集中阅读,如专题新闻聚合领域,专题事件聚合领域等等。 The same theme polymerization method of the invention can be applied to multiple network documentation related fields, user-focused reading, such as news aggregators thematic areas, special events polymerization field, and so on.

[0031] 参阅图2,示出本发明聚合相同主题文档的方法第一实施例,具体步骤如下所述。 Polymerization process of the present invention relating to the same document [0031] Referring to Figure 2, there is shown a first embodiment, the following steps. [0032] 步骤S201、获取当前网络文档中各词语的权重值。 [0032] step S201, the network retrieves the current weight value of the weight of each word in the document. 将当前网络文档中各词语间隔开,去除介词、语气词、感叹词等不具有实质意义的虚词,提取名词、动词等具有实质意义的词语,依次计算提取词语的权重值。 The current network document each word spaced apart, remove the empty word does not have a meaningful prepositions, modal, interjections, etc., extract nouns, words have real meaning of the verb, etc., in order to calculate the weight value to extract the words. 权重值表示该词语同当前网络文档主题内容的相关程度,相关程度越高,权重值也相应越高。 Weight value indicates that the current network with the words The higher the relevance of the subject matter of the document, the degree of correlation, the higher the weight value accordingly.

[0033] 例如,当前网络文档为一篇专利文档,该文档中与专利紧密相关词语的权重值就相对较高,如,"专利"、"申请"、"无效"、"审查"、"复审"等等。 [0033] For example, the current network document as a patent document, which is closely related to the words in the patent weight value is relatively high, such as "patent", "application", "invalid", "Review", "review "and many more.

[0034] 步骤S202、依次选取两个或两个以上权重值较高的词语组成检索词,利用组成的检索词检索相同主题网络文档,直至某个检索词检索的相同主题网络文档数量超过预置数值。 [0034] step S202, sequentially select two or more higher weight value search word composed of words, a search using a search word consisting of the same subject web documents, web documents relating to the same number until the search terms of a exceeds a preset value. 预置数值可的取值范围可大于10。 The preset value may be in the range may be greater than 10.

[0035] 将首先选取的检索词检索相同主题网络文档,判断检索到的相同主题网络文档是否超过预置数值,如是,停止组成检索词,提取检索到的网络文档;如否,继续组成检索词重新检索,直至某个检索词检索的相同主题网络文档数量超过预置数值。 [0035] The first search word selected web documents relating to retrieve the same, judging whether the retrieved web documents relating to the same exceeds a preset value, and if so, stop the search word composed of extracts the retrieved web document; if not, continue the search word composed of New search until the same number of thematic networks documents a search terms found exceeds a preset value.

[0036] 例如,在上述专利文档中,选取权重值较高的"专利"和"申请"两个词语组成检索词"专利申请",使用检索词"专利申请"检索相同主题网络文档,判断检索到的网络文档数量是否超过10个,如是,停止检索;如否,继续在上述专利文档中选取权重值较高词语组成检索词,如选取"专利"和"无效"两个词语组成检索词"专利无效",重新检索,直到某个检索词检索到网络文档的数量超过10个。 [0036] For example, in the above patent document, selecting a higher weight value "patent" and "application" consisting of two words in the search word "patent application", use the search word "patent application" web documents relating to the same retrieval, retrieval is determined whether the network number of documents to more than 10, and if so, retrieve stopped; if not, continue to select the right in the above patent document composed of words in a higher weight value search term, such as selecting "patent" and the two words "invalid" consisting search word " patent is invalid, "New search until a search term to retrieve a web document number more than 10.

[0037] 本发明可采取多种方式选取两个或两个以上权重值较高的词语组成检索词,其目的是使检索词能够在尽可能的反应当前网络文档的主题内容特性。 [0037] The present invention may take a variety of ways to select two or more higher weight values ​​words composed of the search word, the search word which purpose is to enable the reaction can be the subject matter of this document in the characteristics of the network as possible.

[0038] 例如,将权重值超过设定数值的词语组成词语库,在该词语库中随机选取两个或两个以上的词语组成检索词。 [0038] For example, the weight value exceeds the set value of the words in the phrase database composed of randomly selected words composed of two or more terms in the term database.

[0039] 再例如,将词语按权重值降序排列,将首个词语依次与第二、第三、第四个词语组合成检索词。 [0039] For another example, the words in decreasing order of a weight value, the first words are sequentially combined with the second, third, fourth words into a search word. 举例为,词语排列为A、 B、 C、 D…,依次组成的检索词为AB、 AC、 AD…。 For example, for the arrangement of the words A, B, C, D ..., sequentially search word consisting of AB, AC, AD .... [0040] 再例如,将词语按权重值降序排列,从首个词语开始,依次将上一词语与该词语相临的下一词语组成检索词。 [0040] For another example, the words in decreasing order of a weight value, starting from the first words, and words in a sequentially on the words in the next adjacent word search term composition. 举例为,词语排列为A、 B、 C、 D…,依次组成的检索词为AB、 BC、CD…。 For example, for the arrangement of the words A, B, C, D ..., sequentially search word consisting of AB, BC, CD ....

[0041] 步骤S203、聚合上述当前网络文档和相同主题网络文档。 [0041] step S203, the polymerization of the current network and the same document relating to web documents.

[0042] 本发明组合当前网络文档中权重值较高的词语作为检索词,检索相同主题网络文档,因权重值高的词语,具有很强的代表性,能够很好的反应当前网络文档的特性。 High [0042] The compositions of the present invention, the current network document weighting values ​​words as the search term, the search web documents relating to the same, due to the high weight value words, have a strong representation, it can be a good response characteristics of the current network document . 由两个或两个以上权重值较高的词语组成的检索词检索到的网络文档,与当前网络文档同主题的可能性非常大。 Search terms by two or more than two higher weight value words consisting of a network to retrieve documents, and the possibility of the current document with the theme of the network is very large.

[0043] 本发明在选取同主题网络文档的过程中,只需选取合适的词语组成检索词检索,不需将查找的各种网络文档与各类主题的网络文档一一对比,需处理的数据量较小,在应用过程中,网络更新速度快,有利于提高用户的体验感。 [0043] The present invention is in the process of selecting the network with the theme of the document, simply select the appropriate words composed of the search terms, does not need to look for a variety of network documentation Web documents with various theme-one comparisons, data to be processed the amount is small, in the application process, fast update speed network, will help improve the user experience a sense of.

[0044] 在本发明上述步骤S201中,本发明可采用多种方式计算当前网络文档中各词语的权重值,其主体思想是利用词语在当前网络文档中的出现频次,及该词语在各网络文档中通用程度,通过相关算式计算该词语的权重值。 [0044] In the step S201 of the present invention, the present invention may be employed in various ways to calculate the current weight value of the network document each of the words in which the main idea is to use a word appearing in the current network document frequency, and the word in the network document GM degree, calculated by related words in the formula weight value.

[0045] 参阅图3,示出本发明计算当前网络文档中各词语的权重值的方法,具体步骤如下所述。 [0045] Referring to Figure 3, illustrating the method of the present invention to calculate the current weight value of the network weights of the words in the document, the specific steps are as follows.

[0046] 步骤S301、在当前网络文档中统计各词语的出现频次。 [0046] step S301, the statistical frequency of appearance of each word in the current network documentation. 词语在当前网络文档中出现次数越多,出现位置越重要,该词语的出现频次也就越大。 The more times the words appear in the current document network, the more important position appear, the words appear frequency greater. 本发明可采用多种方式统计词语的出现频次,在此介绍优选的两种统计方式。 Statistics can be appeared in various ways according to the present invention, the frequency of words in this preferred introduce two statistical methods.

[0047] 例如,获取词语在当前网络文档中出现的位置,及在该位置的出现次数,将词语在该位置的出现次数乘以该位置对应系数,乘积相加得到的数值,作为该词语的出现频次。 [0047] For example, acquires the current position of a word appearing in the network document, the location and number of occurrences of the words is multiplied by the coefficient corresponding to the position in the position number of occurrences, a value obtained by adding a product, as that term frequency of occurrence. 如,词语在当前网络文档的标题位置出现1次,在当前网络文档正文中出现15次,标题位置的系数为8,正文位置的系数为l,则该词语的出现频次为1X8+15X1 = 23。 The word appears at the current head position of web documents 1, 15 occurrences in the current network body of the document, the coefficient of the title position is 8, the coefficient text position is l, the occurrence of the word frequency is 1X8 + 15X1 = 23 . [0048] 再例如,统计词语在当前网络文档中出现次数,判断该词语是否在网络文档主题位置出现,如是,在该词语总出现次数上加设定数值,作为该词语的出现频次。 [0048] For another example, network statistics term appears in the current document number, determines whether the word appears in the document relating to the network location, and if so, setting the number of adding value always appears in the words, as the frequency of occurrence of the word. 如,统计某词语在当前网络文档中总出现次数为12,该词语在网络文档主题位置出现,设定数量为10,该词语的出现频次为12+10 = 22。 For example, a statistical word always appears in the current document number is the network 12, the word appears in the web documents relating to position, set number 10, the frequency of occurrence of the word is 12 + 10 = 22.

[0049] 步骤S302、获取各词语命中的索引文档数量和总索引文档数量。 [0049] step S302, acquires the number of indexed documents and the total number of words in each document index hit. 网络服务器通过遍历方式获得各种网络文档的总索引文档数量,再利用该词语在总索引文档中检索,统计该词语命中的索引文档数量。 Network server access to a variety of web documents by way of traversing the total number of indexed documents, re-use of the term in the overall index to retrieve the document, count the number of words in the document index hit.

[0050] 步骤S303、计算得到该词语的权重值,权重值计算算式为:[0051] 词语权重值=TFXlg(N/n); [0050] step S303, the calculated value of the word weight, the weight value calculation formula is: [0051] words weight value = TFXlg (N / n);

[0052] 其中,TF为该词语的出现频次,N表示总索引文档的数量,n表示该词语命中的索引文档数量。 [0052] where, TF occurrence frequency for the words, N indicates the total number of documents indexed, n is an index of documents of the word indicates the number of hits. [0053] 当然,本发明还可采用其它多种权重值算式进行计算,例如[0054] 词语权重值=TFXK(N/n),其中,K为系数。 [0053] Of course, the present invention may employ a variety of other weight value equation calculation, for example, [0054] words weight value = TFXK (N / n), where, K is a coefficient. [0055] 再例如 [0055] For another example

[0056] 词语权重值=TFX (N/n)+Z,其中,Z为常数。 [0056] words weight value = TFX (N / n) + Z, wherein, Z is a constant.

[0057] 本发明通过词语在当前网络文档中的出现频次,及该词语在各网络文档中通用程度,计算出该词语相对于当前网络文档的权重值,该权重值可较好的反应该词语对于当前网络文档特性的代表程度。 [0057] The present invention word appears frequency, and the word in the current network document in the network document generic degree calculated this term relative to the current weight value of the network document, the weighting value may preferably anti should words for the current document represents the degree of network characteristics.

[0058] 为进一步保证检索的同主题网络文档的与当前网络文档相关程度高,本发明可采用多种方式对检索到的网络文档进一步筛选,选取与当前网络文档相关程度高的网络文档。 [0058] To further ensure high current network document relevance, the present invention may be employed with the subject matter of the document retrieval network in various ways to further screening of retrieved web documents, web documents related to the selection and high degree of current network documentation.

[0059] 参阅图4,示出本发明聚合网络相关文档的方法第二实施例,具体步骤如下所述。 [0059] Referring to Figure 4, illustrates a method of polymerization of the network-related documents second embodiment of the present invention, the following steps. [0060] 步骤S401、获取当前网络文档中各词语的权重值。 [0060] step S401, the network retrieves the current weight value of the weight of each word in the document.

[0061] 步骤S402、依次选取两个或两个以上权重值较高的词语组成检索词,利用组成的检索词检索相同主题网络文档,直至某个检索词检索的相同主题网络文档数量超过预置数值。 [0061] step S402, the higher sequentially select two or more words consisting of weight values ​​search word, a search using a search word consisting of the same subject web documents, web documents relating to the same number until the search terms of a exceeds a preset value.

[0062] 步骤S403、使用哈希表表示当前网络文档和检索到的网络文档中各词语的向量值。 [0062] step S403, the hash table used to represent the current value and the retrieved web document web documents words each.

[0063] 步骤S404、按出现频次升序排列当前网络文档和检索到网络文档中的各词语。 [0063] step S404, the frequency appear in ascending order according to the current network are arranged to each of the words in the document and retrieve the network document. [0064] 步骤S405、依据各词语的向量值计算检索到的网络文档与当前网络文档的相关度值。 [0064] step S405, the calculation based on the retrieved web document the magnitude of each correlation value of the words of the document with the current network. 计算算式为: Calculation formula is:

[0066] 其中,ai表示当前网络文档中各词语的向量值,bi表示检索到的网络文档中各词语的向量值。 [0066] where, ai represents the current value to the network document, bi represents the magnitude of each of the words in the retrieved web document words each.

[0067] 步骤S406、去除相关度值低于预设数值的相同主题网络文档。 [0067] step S406, a web document relating to the removal of the same correlation value is lower than a preset value. 预设数值可根据当前网络文档的主题类型进行调整。 Default values ​​can be adjusted according to the type of network documentation of the current theme.

[0068] 步骤S407、聚合当前网络文档和相同主题网络文档。 [0068] step S407, and the same polymeric web document relating to the current web documents.

[0069] 本发明通过词语向量计算当前网络文档与检索到的网络文档相关度值,选择与当 [0069] The present invention is a term vector computed by the current network and document web documents retrieved correlation value, and when the selection

前网络文档相关程度较高的网络文档,进一步提高聚合同主题网络文档的精度。 Web documents before the high degree of correlation of network documentation, thematic networks to further improve the poly contract document accuracy.

[0070] 基于上述聚合相同主题网络文档的方法,本发明还提供一种聚合相同主题网络文 [0070] The polymerization of the same subject based on web documents, the present invention further provides a polymeric network packets relating to the same

档的系统,该系统能够提高网络更新速度,提高用户的体验感。 File system, the system can update to improve network speed, improve the user experience a sense of.

[0071] 参阅图5,示出本发明聚合相同主题网络文档的系统第一实施例,包括权重值计算模块51、检索词组成模块52、网络文档检索模块53、及聚合模块54。 [0071] Referring to Figure 5, illustrating the present invention relating to the same polymerization system of the first embodiment of the web document, including a weight value calculating module 51, module 52-word retrieval, document retrieval network module 53, and the aggregation module 54.

[0072] 权重值计算模块51获取当前网络文档中各词语的权重值。 [0072] The weight calculation module 51 acquires the current weight value of the network weights of the words in the document. 权重值表示该词语同当前网络文档主题内容的相关程度,相关程度越高,权重值也相应越高。 Weight value indicates that the current network with the words The higher the relevance of the subject matter of the document, the degree of correlation, the higher the weight value accordingly. 权重值计算模块51 将获取的权重值发送到检索词组成模块52。 Weight weighting calculation module 51 transmits the obtained weight values ​​to the composition module 52 the search term.

[0073] 检索词组成模块52依次选取两个或两个以上权重值较高的词语组成检索词。 [0073]-word retrieval module 52 sequentially select two or more high weight value search word composed of words. 检索词组成模块52可将权重值超过设定数值的词语组成词语库,在该词语库中随机选取两个或两个以上的词语组成检索词;检索词组成模块52还可将词语按权重值降序排列,将首个词语依次与第二、第三、第四个词语组合成检索词;检索词组成模块52还可将词语按权重值降序排列,从首个词语开始,依次将上一词语与该词语相临的下一词语组成检索词。 Search word composition module 52 may set the weight value exceeds the value of the words in the phrase database composed of randomly selected words composed of two or more terms in the term database; search word module 52 may also be composed by words weight value in descending order, the first words are sequentially combined with the second, third, fourth words into a search word; search word module 52 may also be composed of words in descending order according to a weight value, starting from the first words, words sequentially on a the word and the next word composed of adjacent search terms. 检索词组成模块52将组成的检索词发送到网络文档检索模块53。 Retrieval module 52-word search terms will be sent to a network consisting of a document retrieval module 53.

[0074] 网络文档检索模块53利用组成的检索词检索相同主题网络文档,直至某个检索词检索的相同主题网络文档数量超过预置数值。 [0074] network document retrieval module 53 using the search-word document network to retrieve the same theme, the same theme until the number of search terms to retrieve a document network exceeds a preset value. 网络文档检索模块53将首先选取的检索词检索相同主题网络文档,判断检索到的相同主题网络文档是否超过预置数值,如是,提取检索到的网络文档;如否,继续获取检索词重新检索,直至某个检索词检索的相同主题网络文档数量超过预置数值。 Network document retrieval module 53 will first select the search terms found in the same thematic network documentation to determine whether the retrieved document the same thematic networks exceeds a preset value, and if so, to extract the retrieved web document; if not, continue to get re-retrieval search terms, until the number of the same theme of a web document search terms found exceeds a preset value. 网络文档检索模块53将提取的网络文档发送到聚合模块54。 Network document retrieval module 53 transmits the extracted web documents to the aggregation module 54. [0075] 聚合模块54聚合上述当前网络文档和检索的网络文档。 [0075] Polymerization Polymerization of network documentation module 54 current network and document retrieval.

[0076] 参阅图6,本发明检索词组成模块52包括词语排列子模块521和组成子模块522。 [0076] Referring to Figure 6, the present invention is composed of the search word module 52 comprises a sub-module 521 and aligned words grouped into sub modules 522. 词语排列子模块521将上述各词语按权重值降序排列,发送到组成子模块522。 The arrangement of the words in each of the sub-module 521 words in decreasing order of a weight value, the sub-module 522 is sent to the composition. 组成子模块522从首个词语开始,依次将上一词语与该词语相临的下一词语组成检索词。 Composition submodule 522 starting from the first words, and the words in a sequentially on the words in the next adjacent word search term composition. [0077] 本发明通过相关模块计算检索到网络文档与当前网络文档之间的相关度,去除相关度较低的网络文档,进一步提高聚合的网络文档的质量。 [0077] The present invention is calculated by retrieving the relevant module to the correlation between the network and the current network document file, remove less relevant web documents, the polymerization further improve the quality of web documents.

[0078] 参阅图7,示出本发明聚合相同主题网络文档的系统第二实施例,包括权重值计算模块51、检索词组成模块52、网络文档检索模块53、聚合模块54、向量值模块55,相关度计算模块56、及去除模块57。 [0078] Referring to Figure 7, there is shown a polymerization with the same theme network documentation system of the present invention, the second embodiment includes a weight value calculating module 51, the search-word module 52, a web document retrieval module 53, aggregation module 54, the magnitude of the module 55 correlation calculating module 56 and module 57 is removed.

[0079] 向量值模块55使用哈希表表示当前网络文档和相同主题网络文档中各词语的向量值,并将各词语的向量值发送到相关度计算模块56。 [0079] Using the module 55 to the value represented by the current hash table relating to the same network and the network document to document the magnitude of each of the words, and each word is transmitted to the correlation value calculation module 56.

[0080] 相关度计算模块56依据各词语的向量值计算检索到的网络文档与当前网络文档的相关度值,计算算式为: [0080] The correlation degree calculating module 56 is calculated based on the magnitude of the retrieved web document with the respective words in the current correlation value web documents results in the calculation of:

[0081] &附(々)=~^ 2 2 [0081] & attachment (々) = ~ ^ 22

[0082] 其中,ai表示当前网络文档中各词语的向量值,bi表示检索到的网络文档中各词语的向量值。 [0082] where, ai represents the current value to the network document, bi represents the magnitude of each of the words in the retrieved web document words each. 相关度计算模块56将各检索到的网络文档与当前网络文档之间的相关度值发送到去除模块57。 Correlation calculating module 56 of each retrieved web documents related to transmission values ​​between the current document to the network module 57 is removed.

[0083] 去除模块57去除相关度值低于预设数值的网络文档,将其余网络文档发送到聚合模块54。 [0083] removing module 57 is removed network document relevance value lower than a preset value, the network will send the remaining document to the aggregation module 54. 聚合模块54聚合上述网络文档。 Aggregation module 54 Polymerization of web documents.

[0084] 权重值计算模块51、检索词组成模块52、及网络文档检索模块53在本实施例中的功能和作用和图5所示实施例相同,不再赘述。 [0084] The weight calculation module 51, module 52 composed of the search word, and document retrieval module Examples network function and role in the present embodiment and FIG. 53 the same as the embodiment shown in FIG. 5, is omitted.

[0085] 以上对本发明所提供的一种聚合相同主题网络文档的方法及系统,进行了详细介绍,本文中应用了具体个例对本发明的原理及实施方式进行了阐述,以上实施例的说明只是用于帮助理解本发明的方法及其核心思想;同时,对于本领域的一般技术人员,依据本发明的思想,在具体实施方式及应用范围上均会有改变之处,综上所述,本说明书内容不应理解为对本发明的限制。 [0085] The foregoing method and system for polymerization of the same subject web documents according to the present invention provides, described in detail herein through specific examples of the principles and embodiments of the invention are set forth in description of the above embodiment only to assist in understanding the method and core ideas of the present invention; Meanwhile, those of ordinary skill in the art, according to the idea of ​​the present invention, there are changes in the specific embodiments and application scope of, the above, the present specification shall not be construed as limiting the present invention.

Claims (8)

  1. 一种聚合相同主题网络文档的方法,其特征在于,包括:获取当前网络文档中各词语的权重值;将上述各词语按权重值降序排列;从首个词语开始,依次将上一词语及同该词语相临的下一词语组成检索词,利用组成的检索词检索相同主题网络文档,直至某个检索词检索的相同主题网络文档数量超过预置数值;聚合上述当前网络文档和相同主题网络文档。 A method for polymerizing the same subject web documents, characterized by comprising: obtaining a current weight value of the network document weights of the words; and each of the words according to weight values ​​in descending order; starting from the first words, sequentially previous words and the same the next word is the word composed of adjacent search terms, a search using a search word composed of the same subject web documents, web documents relating to the same number until the search terms of a numerical value exceeds a preset; polymerization of the current network web documents relating to the same document, and .
  2. 2. 如权利要求1所述的方法,其特征在于,聚合上述当前网络文档和相同主题网络文档之前,还包括:使用哈希表表示当前网络文档和相同主题网络文档中各词语的向量值;依据所述各词语的向量值计算所述相同主题网络文档与当前网络文档的相关度值;去除相关度值低于预设数值的相同主题网络文档。 2. The method according to claim 1, characterized in that, prior to polymerization of the current network and the same document relating to web documents, further comprising: using a hash table to indicate the current value and the same web document relating to the words of each web documents; calculated on the basis of the same magnitude relating to web documents associated with the current value of each term of the network document; removing the same correlation value is lower than a preset value relating to web documents.
  3. 3. 如权利要求2所述的方法,其特征在于,依据所述各词语的向量值计算所述相同主题网络文档与当前网络文档的相关度值具体为;按出现频次升序排列当前网络文档和相同主题网络文档中的各词语;将相同主题网络文档中的各词语的向量值,与对应的当前网络文档的各词语的向量值分别相乘,获得的积相加,作为第一数据;将相同主题网络文档中各词语的向量值分别平方后,相加;将当前网络文档中各词语的向量值分别平方后,相加;将上述计算的和相乘,再开方,作为第二数据;所述第一数据除以所述第二数据的商,作为相同主题网络文档与当前网络文档的相关度值。 3. The method according to claim 2, characterized in that the value calculated according to each of the words in the web documents relating to the same correlation with the current values ​​for the particular network document; appear in ascending order by frequency and arrangement of the current network documentation each word relating to the same network document; the magnitude of the current corresponding to each of the words is multiplied by the same network relating to each word in the document the value of the network documents, respectively, the product obtained by adding, as the first data; network document relating to the same value after each square, the sum of each term; document the current network, respectively, after each of the words in the squared magnitude of the sum; and by multiplying the calculated above, then the square root, the second data ; the first data by the second data providers, web documents relating to the same as the value associated with the current network document.
  4. 4. 如权利要求1至3任一项所述的方法,其特征在于,获取当前网络文档中每个词语的权重值具体为:在当前网络文档中统计各词语的出现频次,获取各词语命中的索引文档数量和总索引文档数量;将总索引数量除以该词语命中的索引数量,再取对数,得到的数值乘以上述出现频次, 得到该词语的权重值。 4. A method according to any one of claims 1 to 3, wherein obtaining current weight value of the network weights for each term in the document specifically: statistical occurrence frequency of each word in the document in the current network, obtaining respective word hits the total number of documents and the index number of documents indexed; total index number of hits divided by the number of words in the index, and then taking the logarithm of the value obtained by multiplying the frequency of occurrence above, to obtain weight values ​​of the word.
  5. 5. 如权利要求4所述的方法,其特征在于,在当前网络文档中统计各词语的出现频次具体为:获取该词语在当前网络文档中出现的位置,及在该位置的出现次数; 将词语在该位置的出现次数乘以该位置对应系数,乘积相加后作为该词语的出现频次。 5. The method according to claim 4, characterized in that the statistical frequency of occurrence of each word in particular the current network document: obtaining a position of the word appears in the document in the current network, and frequency of occurrence at that position; and the term is multiplied by the corresponding coefficient in the position of the number of occurrences of the position, as the frequency of occurrence of the word after the products are summed.
  6. 6. 如权利要求4所述的方法,其特征在于,在当前网络文档中统计各词语的出现频次具体为:统计词语在当前网络文档中出现次数;判断该词语是否在网络文档主题位置出现,如是,在该词语总出现次数上加设定数值, 作为该词语的出现频次。 6. The method according to claim 4, characterized in that the statistical frequency of occurrence of each word in particular the current network document: the current count word occurrences in the network document; determining whether the word appears in the document relating to a network location, If so, add the value to set the total number of times the word appears, a frequency of occurrence of the word.
  7. 7. —种聚合相同主题网络文档的系统,其特征在于,包括权重值计算模块、检索词组成模块、网络文档检索模块、及聚合模块:所述权重值计算模块,用于获取当前网络文档中各词语的权重值;所述检索词组成模块,用于将上述各词语按权重值降序排列,从首个词语开始,依次将上一词语及同该词语相临的下一词语组成检索词;所述网络文档检索模块,用于利用组成的检索词检索相同主题网络文档,直至某个检索词检索的相同主题网络文档数量超过预置数值;所述聚合模块,用于聚合上述当前网络文档和相同主题网络文档。 7. - web documents relating to the same kind of the polymerization system, characterized by comprising a weight calculation module, the search-word module, the network document retrieval module, and aggregation module: the weight calculation module, configured to obtain a current network document weight of each term weight value; the search-word block, for each of the words in descending order according to the weight value, starting from the first words, and words in a sequentially on the words in the next adjacent word with the search word composed of; the network document retrieval means for using the search-word retrieving web documents relating to the same, until a number of web documents relating to the same search terms found exceeds a preset value; the aggregation module, for polymerization of the current network documentation and the same thematic networks documents.
  8. 8.如权利要求7所述的系统,其特征在于,还包括向量值模块,相关度计算模块、去除模块:所述向量值模块,用于使用哈希表表示当前网络文档和相同主题网络文档中各词语的向量值;所述相关度计算模块,用于依据所述各词语的向量值计算所述相同主题网络文档与当前网络文档的相关度值;所述去除模块,用于去除相关度值低于预设数值的相同主题网络文档。 8. The system according to claim 7, characterized in that the module further comprises a vector value, the correlation calculation module, the module is removed: the value of the module is configured to use a hash table showing the current network web documents relating to the same document, and the magnitude of each of the words; the correlation degree calculating means for calculating according to the magnitude of the web documents relating to the same network with the current value of each document relevance words; the removing module for removing correlation the value of the same thematic networks documents below a preset value.
CN 200810088055 2008-03-27 2008-03-27 Method and system for polymerizing the same subject network document files CN101246501B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN 200810088055 CN101246501B (en) 2008-03-27 2008-03-27 Method and system for polymerizing the same subject network document files

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN 200810088055 CN101246501B (en) 2008-03-27 2008-03-27 Method and system for polymerizing the same subject network document files

Publications (2)

Publication Number Publication Date
CN101246501A true CN101246501A (en) 2008-08-20
CN101246501B true CN101246501B (en) 2010-06-23

Family

ID=39946952

Family Applications (1)

Application Number Title Priority Date Filing Date
CN 200810088055 CN101246501B (en) 2008-03-27 2008-03-27 Method and system for polymerizing the same subject network document files

Country Status (1)

Country Link
CN (1) CN101246501B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102799647B (en) * 2012-06-30 2015-01-21 华为技术有限公司 Method and device for webpage reduplication deletion
CN103631791B (en) * 2012-08-22 2017-04-12 腾讯科技(深圳)有限公司 Displaying information collation and the polymerization system
CN103853787B (en) * 2012-12-06 2017-06-16 北大方正集团有限公司 One kind of tracking method similar articles and pictures and systems
CN104123320A (en) * 2013-04-28 2014-10-29 百度在线网络技术(北京)有限公司 Method and device for obtaining related questions corresponding to input question

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1317114A (en) 1998-07-10 2001-10-10 快速检索及传递公司 Search system and method for retrieval of data, and use thereof in search engine

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1317114A (en) 1998-07-10 2001-10-10 快速检索及传递公司 Search system and method for retrieval of data, and use thereof in search engine

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
徐宝文,张卫丰.搜索引擎与信息获取技术 1.清华大学出版社,2003,第12-14页,第47页,第86-90页,第114页.

Also Published As

Publication number Publication date Type
CN101246501A (en) 2008-08-20 application

Similar Documents

Publication Publication Date Title
US6785688B2 (en) Internet streaming media workflow architecture
US7461064B2 (en) Method for searching documents for ranges of numeric values
Wu et al. Query selection techniques for efficient crawling of structured web sources
US7636714B1 (en) Determining query term synonyms within query context
US6473752B1 (en) Method and system for locating documents based on previously accessed documents
US7231399B1 (en) Ranking documents based on large data sets
US20070022085A1 (en) Techniques for unsupervised web content discovery and automated query generation for crawling the hidden web
US6584468B1 (en) Method and apparatus to retrieve information from a network
US20110246457A1 (en) Ranking of search results based on microblog data
US7580929B2 (en) Phrase-based personalization of searches in an information retrieval system
US6826576B2 (en) Very-large-scale automatic categorizer for web content
US7584175B2 (en) Phrase-based generation of document descriptions
US7599914B2 (en) Phrase-based searching in an information retrieval system
US7536408B2 (en) Phrase-based indexing in an information retrieval system
US20070043723A1 (en) Interactive user-controlled relevanace ranking retrieved information in an information search system
US20070192300A1 (en) Method and system for determining relevant sources, querying and merging results from multiple content sources
US7426507B1 (en) Automatic taxonomy generation in search results using phrases
US20030145001A1 (en) Computerized information search and indexing method, software and device
US7308464B2 (en) Method and system for rule based indexing of multiple data structures
US20120124034A1 (en) Co-selected image classification
US8412699B1 (en) Fresh related search suggestions
US20060200460A1 (en) System and method for ranking search results using file types
US6321228B1 (en) Internet search system for retrieving selected results from a previous search
US20100169331A1 (en) Online relevance engine
US20060149606A1 (en) System and method for agent assisted information retrieval

Legal Events

Date Code Title Description
C06 Publication
C10 Request of examination as to substance
C14 Granted
ASS Succession or assignment of patent right

Owner name: SHENZHEN SHIJI LIGHT SPEED INFORMATION TECHNOLOGY

Free format text: FORMER OWNER: TENGXUN SCI-TECH (SHENZHEN) CO., LTD.

Effective date: 20131021

C41 Transfer of the right of patent application or the patent right
COR Bibliographic change or correction in the description

Free format text: CORRECT: ADDRESS; FROM: 518044 SHENZHEN, GUANGDONG PROVINCE TO: 518057 SHENZHEN, GUANGDONG PROVINCE