CN105260469B - A method of processing site map, installations and equipment - Google Patents

A method of processing site map, installations and equipment Download PDF

Info

Publication number
CN105260469B
CN105260469B CN 201510676894 CN201510676894A CN105260469B CN 105260469 B CN105260469 B CN 105260469B CN 201510676894 CN201510676894 CN 201510676894 CN 201510676894 A CN201510676894 A CN 201510676894A CN 105260469 B CN105260469 B CN 105260469B
Authority
CN
Grant status
Grant
Patent type
Prior art keywords
site map
access
site
page
search
Prior art date
Application number
CN 201510676894
Other languages
Chinese (zh)
Other versions
CN105260469A (en )
Inventor
梁捷
梁卡喆
Original Assignee
广州神马移动信息科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Grant date

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/30Information retrieval; Database structures therefor ; File system structures therefor

Abstract

本发明公开一种处理网站地图的方法、装置及设备。 The present invention discloses a method of processing site map, installations and equipment. 该方法包括:根据预设信息获取网站的网站地图;获取网站地图中页面的链接并进行访问;根据访问结果删除网站地图中影响搜索收录的链接;生成新网站地图。 The method includes: according to preset information access to the Web site map; for a link site map page and access; affect the search included a link to remove your site map access based on the results; generate a new site map. 本发明提供的技术方案,能提升网站地图sitemap质量,也可以增加被搜索引擎收录的可能性,满足网站和搜索引擎各自的需要。 The present invention provides technical solutions that can enhance the quality sitemap site map, it can also increase the probability of being included in search engines, and search engine Web site to meet their needs.

Description

一种处理网站地图的方法、装置及设备 A method of processing site map, installations and equipment

技术领域 FIELD

[0001] 本发明涉及移动互联网技术领域,具体涉及一种处理网站地图的方法、装置及设备。 [0001] The method, apparatus, and an apparatus for processing a sitemap present invention relates to mobile Internet technology, particularly relates.

背景技术 Background technique

[0002] 目前,搜索引擎通常会通过网站(也称为站点)内部和其他网站上的链接查找网页,网站地图sitemap可方便网站通知搜索引擎在网站上有哪些可供抓取的网页。 [0002] Currently, search engines usually through the website (also known as a site) and internal links on the Web site to find other pages, site map sitemap site notice which can be easily crawlable Web search engines on the Web site. 最简单的sitemap形式,就是XML (Extensible Markup Language,可扩展标记语言)文件,在其中列出网站中的网址以及关于每个网址的其他元数据(上次更新的时间、更改的频率以及相对于网站上其他网址的重要程度等),以便搜索引擎可以更加智能地抓取网站内容。 The simplest form of sitemap is XML (Extensible Markup Language, Extensible Markup Language) file that lists URLs from your site, and other metadata about each URL (last updated, as well as with respect to frequency changes the importance of other URLs on the site, etc.), so that search engines can more intelligently crawl the site. 简单来讲, sitemap可以理解为网站上链接的列表。 Simply put, sitemap can be understood as a list of links on the site. 生成sitemap并提交给搜索引擎,可以使网站的内容容易被收录,包括那些隐藏比较深的页面,这是一种网站与搜索引擎对话的好方式。 Generate a sitemap and submit it to the search engine, you can make the content of the website easily indexed, including those hiding deeper pages, this is a good way to dialogue with the site search engine.

[0003] 但是,目前很多网站提供的sitemap里面包含的网站链接的质量有可能出现不少问题,例如链接错误,链接的内容劣质或未及时更新等,这些情况都会浪费搜索引擎爬取的资源,这样就导致了虽然网站提供了sitemap,但搜索引擎根据爬取的结果,并不一定会收录sitemap的网站链接,同时还可能触发搜索引擎的降权规则,减少对该网站收录的链接数量和降低该网站的搜索排序等。 [0003] However, the quality of the current site links sitemap site offers a lot of which is likely to contain many problems, such as link errors, poor quality content or links up to date and so on, these cases will be a waste of resources search engine crawling, this has led Although the site offers sitemap, but according to the results of the search engine crawling, not necessarily included site links sitemap, but also could trigger down the right search engine rules, reducing the number of links on the site and included reduced Sort search the site and so on.

[0004] 因此,现有的网站地图的处理方法,不能满足网站和搜索引擎各自的需要。 [0004] Thus, treatment of an existing site map, and search engine Web site does not meet their needs.

发明内容 SUMMARY

[0005] 为解决上述技术问题,本发明提供一种处理网站地图的方法、装置及设备,能满足网站和搜索引擎各自的需要。 [0005] To solve the above problems, the present invention provides a method, device and apparatus for processing site map, and search engine sites to meet their needs.

[0006] 根据本发明的一个方面,提供一种处理网站地图的方法,包括: [0006] In accordance with one aspect of the invention, there is provided a method for processing site map, comprising:

[0007] 根据预设信息获取网站的网站地图; [0007] According to the preset information access to the Web site map;

[0008] 获取网站地图中页面的链接并进行访问; [0008] for a link to the site map page and access;

[0009] 根据访问结果删除网站地图中影响搜索收录的链接; [0009] delete a site map included links affect search results based on access;

[0010] 生成新网站地图。 [0010] generate a new site map.

[0011] 优选地,所述获取网站地图中页面的链接并进行访问之后还包括: After the [0011] Preferably, the page for a link to the site map and access further comprises:

[0012] 对访问的页面提取关键词和正文特征值; [0012] extracting keywords and text page access characteristic value;

[0013] 根据提取的关键词和正文特征值与预存的关键词和正文特征值的比较结果,删除网站地图中影响搜索收录的链接。 [0013] According to the results of the comparison keywords and text feature values ​​extracted keywords and text eigenvalues ​​and pre-stored, deleted affect search site map link included.

[0014] 优选地,所述根据访问结果删除网站地图中影响搜索收录的链接包括: [0014] Preferably, the deleted site map included affect the search results include links based on access:

[0015] 在访问结果是出现无法访问的HTTP 404错误时,删除对应的链接;或, When [0015] is the result of the visit can not access the HTTP 404 error appears, remove the corresponding link; or,

[0016] 在访问结果是页面响应时间大于或等于设定阈值时,删除对应的链接;或, [0016] When the access result page response time is equal to or greater than the set threshold, remove the corresponding link; or,

[0017] 在访问结果是页面的标题、关键词和描述不完整时,删除对应的链接;或, [0017] In the access result is the title of the page, keywords and description incomplete when, delete the corresponding link; or,

[0018] 在访问结果是页面的正文内容与页面的标题、关键词和描述不匹配时,删除对应的链接。 [0018] In the access result is the title of the text and page content pages, keywords and description does not match, delete the corresponding link.

[0019] 优选地,所述根据提取的关键词和正文特征值与预存的关键词和正文特征值的比较结果,删除网站地图中影响搜索收录的链接包括: [0019] Preferably, according to the extracted keywords and text feature value and a comparison result of the keyword and the text feature values ​​pre-stored, deleted site map links included affect the search comprises:

[0020] 根据提取的关键词和正文特征值与预存的关键词和正文特征值的比较结果是一致,判断为内容重复提交,删除对应的链接。 [0020] Image and text based on a comparison of keywords and the body feature extraction and feature values ​​pre-stored values ​​are identical, it is determined repeatedly submit content, delete the corresponding link.

[0021] 优选地,所述方法还包括: [0021] Preferably, the method further comprising:

[0022] 将生成的新网站地图提供给搜索引擎访问。 [0022] The resulting map provides a new website to search engines.

[0023] 优选地,所述方法还包括: [0023] Preferably, the method further comprising:

[0024] 记录所述搜索引擎访问新网站地图后进行搜索并收录的收录数据。 [0024] recording the search engine and search indexed data included after visiting the new site map.

[0025] 根据本发明的另一方面,提供一种处理网站地图的装置,包括: [0025] According to another aspect of the present invention, there is provided an apparatus for processing site map, comprising:

[0026] 获取模块,用于根据预设信息获取网站的网站地图; [0026] acquiring module for access to the Web site map based on preset information;

[0027] 访问模块,用于根据所述获取模块获取的网站地图,获取网站地图中页面的链接并进行访问; [0027] Access module for the obtaining module obtains site map for links to the site map page and access based;

[0028] 第一处理模块,用于根据所述访问模块的访问结果删除网站地图中影响搜索收录的链接; [0028] The first processing module, affect the search for a collection of links to delete a site map based on the result of access to the access module;

[0029] 生成模块,用于在所述第一处理模块进行处理后生成新网站地图。 [0029] The generating module, for generating a new sitemap post-processed in the first processing module.

[0030] 优选地,所述装置还包括: [0030] Preferably, said apparatus further comprising:

[0031] 第二处理模块,用于对访问的页面提取关键词和正文特征值,根据提取的关键词和正文特征值与预存的关键词和正文特征值的比较结果,删除网站地图中影响搜索收录的链接; [0031] The second processing module for extracting characteristic values ​​of the keywords and text pages visited, according to the comparison keywords and text feature value extracted keywords and text eigenvalues ​​and pre-stored, deleted affect search site map a collection of links;

[0032]所述生成模块在所述第一处理模块和所述第二处理模块进行处理后,生成新网站地图。 After [0032] The generating module for processing the first processing module and the second processing module, generates a new sitemap.

[0033] 优选地,所述装置还包括: [0033] Preferably, said apparatus further comprising:

[0034]输出模块,用于将所述生成模块生成的新网站地图提供给搜索引擎访问。 [0034] Output module for the generation module generates a new site map provides access to the search engine.

[0035] 优选地,所述装置还包括: [0035] Preferably, said apparatus further comprising:

[0036]监控模块,用于记录所述搜索引擎访问新网站地图后进行搜索并收录的收录数据。 Collection of data search after [0036] Monitoring module for recording the search engine and access the new site map included.

[0037]优选地,所述第一处理模块包括: [0037] Preferably, the first processing module comprises:

[0038] 第一删除单元,用于在访问结果是出现无法访问的HTTP 404错误时,删除对应的链接;或, When [0038] the first deletion unit configured to access a result can not access an HTTP 404 error occurs, deletes the corresponding link; or,

[0039]第二删除单元,用于在访问结果是页面响应时间大于或等于设定阈值时,删除对应的链接;或, [0039] The second deletion unit configured to access a page response time results equal to or greater than the set threshold, remove the corresponding link; or,

[0040] 第三删除单元,用于在访问结果是页面的标题、关键词和描述不完整时,删除对应的链接;或, [0040] Third deleting unit, configured to access the results page title, keywords and description is not complete when a link corresponding to deleted; or,

[0041]第四删除单元,用于在访问结果是页面的正文内容与页面的标题、关键词和描述不匹配时,删除对应的链接。 [0041] Fourth delete unit configured to access the contents of the text and the result is the title page of the page, keywords and description does not match, delete the corresponding link.

[0042] 根据本发明的另一方面,提供一种处理设备,包括: [0042] According to another aspect of the present invention, there is provided a processing apparatus, comprising:

[0043] 存储器,用于存储程序, [0043] memory, for storing programs,

[0044]处理器,用于执行所述存储器存储的以下程序: [0044] a processor for executing the following programs stored in the memory:

[0045]根据预设信息获取网站的网站地图; [0045] According to the preset information access to the Web site map;

[0046]获取网站地图中页面的链接并进行访问; [0046] for a link to the site map page and access;

[0047]根据访问结果删除网站地图中影响搜索收录的链接; [0047] delete a site map included links affect search results based on access;

[0048] 生成新网站地图。 [0048] generate a new site map.

[0049]可以发现,本发明实施例的技术方案,通过获取网站地图中页面的链接后先进行访问,根据访问结果发现有影响搜索收录的链接后,就删除网站地图中影响搜索收录的链接,再生成新网站地图,这样就可以实现对网站的原先的网站地图进行优化处理,尽量避免网站地图中出现各种内容不好或容易出错的链接,从而可以提升网站地图质量,也可以增加被搜索引擎收录的可能性,满足网站和搜索引擎的需求。 [0049] can be found, the technical solutions of the embodiments of the invention, the first visit by a site map page for a link to the post, found a link included influential search, delete the sitemap affect the search results based on the access link included, and then generate a new site map, so that you can achieve the original site map site optimization process, try to avoid all kinds of bad content or error-prone links site map appears, which can improve the quality of the site map, the search can also be increased the possibility of engines to meet the needs of web sites and search engines.

附图说明 BRIEF DESCRIPTION

[0050] 通过结合附图对本公开示例性实施方式进行更详细的描述,本公开的上述以及其它目的、特征和优势将变得更加明显,其中,在本公开示例性实施方式中,相同的参考标号通常代表相同部件。 [0050] By reference to the exemplary embodiments of the present disclosure will be described in more detail in conjunction with the present disclosure the above and other objects, features and advantages will become more apparent, wherein, in an exemplary embodiment of the present disclosure, the same reference numbers generally refer to the same parts.

[0051] 图1是根据本发明的一个实施例的处理网站地图的方法的示意性流程图; [0051] FIG. 1 is a schematic flow chart of a method for processing a site map of the embodiment of the present invention;

[0052] 图2是根据本发明的一个实施例的处理网站地图的方法的另一示意性流程图; [0053]图3是根据本发明的一个实施例的处理网站地图的方法的另一示意性流程图; [0054]图4是本发明的一种处理网站地图的装置的示意性方框图; [0052] FIG. 2 is another schematic flowchart of a method for processing a site map of the embodiment of the present invention; [0053] FIG. 3 is a schematic of another method of processing a site map of the embodiment of the present invention. flowchart; [0054] FIG. 4 is a schematic block diagram of a processing apparatus of the present invention, the site map;

[0055]图5是本发明的一种处理网站地图的装置的另示意性方框图; [0055] FIG. 5 is another schematic block diagram of a processing apparatus of the present invention, the site map;

[0056]图6是本发明的一种处理设备的示意性方框图。 [0056] FIG. 6 is a schematic block diagram of a processing device according to the present invention.

具体实施方式 detailed description

[0057]下面将参照附图更详细地描述本公开的优选实施方式。 [0057] The following preferred embodiments of the present disclosure will be described in more detail with reference to the drawings. 虽然附图中显示了本公开的优选实施方式,然而应该理解,可以以各种形式实现本公开而不应被这里阐述的实施方式所限制。 Although a preferred embodiment of the present disclosure in the drawings, however it should be understood that the present disclosure may be implemented and should not be limited to the embodiments set forth herein in various forms. 相反,提供这些实施方式是为了使本公开更加透彻和完整,并且能够将本公开的范围完整地传达给本领域的技术人员。 Rather, these embodiments are provided so that this disclosure will be thorough and complete, and the scope of the present disclosure will fully convey to those skilled in the art.

[0058]本发明提供一种处理网站地图的方法,能满足网站和搜索引擎各自的需要。 [0058] The present invention provides a method of treating a site map, and search engine Web site to meet their needs.

[0059] 图1是根据本发明的一个实施例的处理网站地图的方法的示意性流程图。 [0059] FIG. 1 is a schematic flowchart of a method of processing site map in accordance with one embodiment of the present invention.

[0060] 如图i所示,包括: [0060] As shown in FIG i, comprising:

[0061] 步骤101、根据预设信息获取网站的网站地图。 [0061] In step 101, access to the Web site map based on preset information.

[0062]该步骤中,可以根据与网站协商一致后,根据网站提供的设置信息,获取网站的网站地图。 [0062] In this step, according to the consensus with the site, according to the setting information website, access to the Web site map.

[0063] 步骤102、获取网站地图中页面的链接并进行访问。 [0063] Step 102, for a link to the site map page and access.

[0064] 该步骤中,获取网站地图中的各URL (Uniform Resource Locator,统一资源定位符)链接,并对URL链接分别进行访问以进行验证。 [0064] In this step, acquiring (Uniform Resource Locator, uniform resource locator) of each URL link in the site map, and URL links are accessible for verification.

[0065] 步骤1〇3、根据访问结果删除网站地图中影响搜索收录的链接。 [0065] Step 1〇3, delete a site map included affect the search results based on the access link.

[0066]该步骤中,根据访问结果删除网站地图中影响搜索收录的链接包括: [0066] In this step, the influence delete a site map based on the results of the search included access links include:

[0067] 在访问结果是出现无法访问的HTTP 404错误时,删除对应的链接;或, When [0067] is the result of the visit can not access the HTTP 404 error appears, remove the corresponding link; or,

[0068]在访问结果是页面响应时间大于或等于设定阈值时,删除对应的链接;或, [0068] When the access result page response time is equal to or greater than the set threshold, remove the corresponding link; or,

[0069] 在访问结果是页面的标题、关键词和描述不完整时,删除对应的链接;或, [0069] In the access result is the title of the page, keywords and description incomplete when, delete the corresponding link; or,

[0070] 在访问结果是页面的正文内容与页面的标题、关键词和描述不匹配时,删除对应的链接。 [0070] In the access result is the title of the text and page content pages, keywords and description does not match, delete the corresponding link.

[0071] 步骤104、生成新网站地图。 [0071] Step 104 generates a new sitemap.

[0072] 该步骤中,在删除了网站地图中影响搜索收录的各链接后,重新整理生成新网站地图。 [0072] In this step, after you remove the respective links affect search site map included, rearranged to generate a new site map.

[0073] 可以发现,本发明实施例的技术方案,通过获取网站地图中页面的链接后先进行访问,根据访问结果发现有影响搜索收录的链接后,就删除网站地图中影响搜索收录的链接,再生成新网站地图,这样就可以实现对网站的原先的网站地图进行优化处理,尽量避免网站地图中出现各种内容不好或容易出错的链接,从而可以提升网站地图质量,也可以增加被搜索引擎收录的可能性,满足网站和搜索引擎的需求。 [0073] can be found, the technical solutions of the embodiments of the invention, the first visit by a site map page for a link to the post, found a link included influential search, delete the sitemap affect the search results based on the access link included, and then generate a new site map, so that you can achieve the original site map site optimization process, try to avoid all kinds of bad content or error-prone links site map appears, which can improve the quality of the site map, the search can also be increased the possibility of engines to meet the needs of web sites and search engines.

[0074] 以下进一步更具体介绍本发明的技术方案。 [0074] More specifically described further below aspect of the present invention.

[0075] 图2是根据本发明的一个实施例的处理网站地图的方法的另一示意性流程图。 [0075] FIG. 2 is another schematic flowchart of a method of processing site map in accordance with one embodiment of the present invention. [0076] 如图2所示,包括: [0076] Figure 2, comprising:

[0077] 步骤201、根据预设信息获取网站的网站地图。 [0077] In step 201, access to the Web site map based on preset information.

[0078] 该步骤参见上述步骤101的描述。 [0078] Referring to the above described step 101. Step.

[0079] 步骤202、获取网站地图中页面的链接并进行访问。 [0079] In step 202, for a link to the site map page and access.

[0080] 该步骤参见上述步骤102的描述。 [0080] Referring to the procedure described above in step 102.

[0081] 步骤203、根据访问结果删除网站地图中影响搜索收录的链接。 [0081] In step 203, delete a site map included affect the search results based on the access link.

[0082] 该步骤参见上述步骤103的描述。 [0082] The steps described above refer to step 103.

[0083] 步骤204、对访问的页面提取关键词和正文特征值。 [0083] Step 204, extracting keywords and text page access characteristic value.

[0084] 该步骤中,可利用现有的不同算法对页面的内容进行关键词提取,并对正文内容提取正文特征值,本发明不加以限定。 [0084] In this step, the existing algorithms of different content page is extracted keyword, the extracted text and text content feature value, the present invention is not to be defined.

[0085] 步骤205、根据提取的关键词和正文特征值与预存的关键词和正文特征值的比较结果,删除网站地图中影响搜索收录的链接。 [0085] In step 205, based on a comparison of the value of keywords and text features extracted keywords and text eigenvalues ​​and pre-stored, deleted affect search site map link included.

[0086]该步骤中,是根据提取的关键词和正文特征值与预存的关键词和正文特征值的比较结果是一致,判断为内容重复提交,删除对应的链接。 [0086] In this step, based on a comparison result of the keyword and the text feature value extracted keywords and text pre-stored feature value is consistent, it is determined repeatedly submit the content, to delete the corresponding link.

[0087] 步骤206、生成新网站地图。 [0087] Step 206, to generate a new site map.

[0088] 步骤207、将生成的新网站地图提供给搜索引擎访问。 [0088] In step 207, the resulting new site map provides access to the search engine.

[0089] 该步骤中,可以将生成的新网站地图替换网站原有的网站地图,供搜索引擎到网站访问新网站地图,也可以由网站进行设置,由搜索引擎直接到服务平台访问新网站地图, 本发明不加以限定,只要能让搜索引擎访问新网站地图即可。 [0089] In this step, you can generate a new site map site map to replace the original site for the search engines to visit the new website site map, can also be set by the site, the search engine directly to the service platform to access the new site map The present invention is not limited, so long as to make the search engine can access the new site map.

[0090] 需说明的是,上述步骤2〇2、2〇3的处理与步骤204、205的处理没有必然的顺序关系,上述步骤安排仅为描述的方便。 [0090] It should be noted that, the processing of step 2〇2,2〇3 process of step 204 and 205 are not necessarily sequential relationship, the above-described arrangement of steps merely convenience of description.

[0091] 需说明的是,上述步骤207之后还可以包括:记录所述搜索引擎访问新网站地图后进行搜索并收录的收录数据。 [0091] It should be noted that the above Step 207 may further include: recording data relevant for inclusion and the inclusion of the new site map search engines.

[0092] 可以发现,本发明实施例的技术方案,可以分别根据访问结果删除网站地图中影响搜索收录的链接和根据提取的关键词和正文特征值与预存的关键词和正文特征值的比较结果,删除网站地图中影响搜索收录的链接,提供优化效果。 [0092] can be found in the technical solutions of the embodiments of the present invention, respectively, may be deleted site map and links included affect the search results based on keywords and text comparison feature value extracted keywords and text pre-stored characteristic values ​​according to the access results , delete a site map included links affect search to provide optimal performance. 另外,还可以记录所述搜索引擎访问新网站地图后进行搜索并收录的收录数据,从而为后续的网站地图修改提供参考或供网站进行分析。 Also, you can search records and collection of data included after the search engine to access the new site map, or for providing a reference site for subsequent analysis of site map changes.

[0093]图3是根据本发明的一个实施例的处理网站地图的方法的另一示意性流程图。 [0093] FIG. 3 is another schematic flowchart of a method of processing site map in accordance with one embodiment of the present invention. [0094]如图3所示,包括: [0094] As shown in FIG 3, comprising:

[0095]步骤301、sitemap服务平台根据网站的设置信息对网站的sitemap进行数据提取。 [0095] Step 301, sitemap sitemap site service platform for data extraction based on the setting information website. [0096]该步骤中,网站与sitemap服务平台(下文简称服务平台)预先协商一致,由网站设置sitemap与服务平台的映射关系,允许服务平台根据网站提供的设置信息例如地址信息对sit emap进行处理。 [0096] In this step, the site with the sitemap Services Platform (hereinafter referred to as a service platform) in advance consensus mapping between sitemap and service platform from the site, allowing the service platform address information sit emap processing example in accordance with the setting information site provides . 网站设置映射关系可通过XML实现。 The mapping between sites can be achieved through XML. 服务平台根据网站提供的设置信息,可以对si temap进行数据提取,获取其中的各链接的URL信息。 Service platform based on the settings information website, you can perform data extraction si temap, get information about each URL link to them.

[0097] 步骤3〇2、服务平台将提取的si temap中的URL分别进行检查,判断访问URL是否出现无法访问的HTTP 404错误,如果是,进入步骤311,从si temap中删除该URL并记录原因,如果否,进入步骤303。 [0097] Step 3〇2 service platform will be extracted si temap were checking the URL, HTTP 404 error can not be accessed to determine whether there is access URL, and if so, proceeds to step 311 to remove the URL from si temap and record The reason, if not, proceed to step 303.

[0098] HTTP 404错误意味着链接指向的网页不存在,即原始网页的URL失效,这种情况经常会发生,例如:网页URL生成规则改变、网页文件更名或移动位置、导入链接拼写错误等, 导致原来的URL地址无法访问;当网页服务器接到类似请求时,会返回一个404状态码,告诉浏览器要请求的资源并不存在。 [0098] HTTP 404 error means that the link to the page does not exist, that is, the original URL of the page failure, which often happens, for example: web page URL generation rule change, rename or move the page file location, incoming links, spelling errors, etc., lead to the original URL address inaccessible; when the web server received a similar request, it returns a 404 status code that tells the browser to the requested resource does not exist. 因此,当出现URL无法访问的HTTP 404错误时,表示该URL已经失效,此时从si temap中删除该URL并记录原因。 Therefore, when the URL can not access the HTTP 404 error appears, indicating that the URL has expired, this time to remove the URL from si temap and document the reason.

[00"]步骤303、服务平台判断访问URL的页面响应速度是否异常,如果是,进入步骤311, 从sitemap中删除该URL并记录原因,如果否,进入步骤304。 [00 "] In step 303, the service platform to access the URL of the page to determine whether the abnormal response speed, and if so, proceeds to step 311 to remove the URL from your sitemap and document the reason, and if not, proceed to step 304.

[0100]当URL可以正常访问时,检测页面的响应速度,响应速度可以通过响应时间进行衡量。 [0100] When the normal access URL, the response speed, the response speed of detecting the page response time can be measured. 如果响应时间大于或等于设定阈值,认为响应速度异常,如果小于设定阈值,认为响应速度正常。 If the response time is greater than or equal to the set threshold value, that the abnormal response speed, is less than a set threshold value, that the normal response speed. 设定阈值,可以根据经验取值,例如设置为500毫秒或1秒,本发明不加以限定。 Setting a threshold value, the value can be empirically, for example, set to 500 milliseconds or 1 second, the present invention is not to be defined. [0101]需说明的时,也可以根据页面历史访问响应速度与当前访问响应速度进行对比, 判断响应速度是否异常。 [0101] For illustration, may be compared with the response speed of the current access based on the page access history of the response speed, the response speed is abnormal is determined. 如果当前响应时间比历史响应时间大很多,超过某个阈值,可认为响应速度异常。 If the current response time is greater than the historical response time a lot, exceeds a certain threshold, the speed of response can be considered abnormal.

[0102]因此,当页面响应速度异常时,表示该URL对应的页面可能有问题或URL对应的网络连接可能有问题,这些都会影响用户的浏览体验,此时从sitemap中删除该URL并记录原因。 [0102] Thus, when the page response velocity anomalies, indicating that the page corresponding to the URL corresponding to the URL may be a problem or a network connection may be a problem, which will affect the user's browsing experience, this time to remove the URL from your sitemap and document the reason .

[0103] 步骤304、服务平台判断页面的TKD是否不完整,如果是,进入步骤311,从sitemap 中删除该URL并记录原因,如果否,进入步骤305。 [0103] Step 304, the service platform TKD determine whether the page is not complete, and if so, proceeds to step 311 to remove the URL from your sitemap and document the reason, and if not, proceed to step 305.

[0104] TKD是标题title、关键词keywords、描述description的缩写。 [0104] TKD is heading title, keywords keywords, description description of the abbreviations. TKD的格式内容可以如下所示: TKD format of the content may be as follows:

[0105] 〈title〉这里是标题内容〈/title〉 [0105] <title> is the title of the content here </ title>

[0106] <meta name = "keywords"content ="这里是关键词内容"/> [0106] <meta name = "keywords" content = "This is the keyword content" />

[0107] <meta 1131116 = 〃(163(;14口1::1〇11〃(:〇1^61^ = 〃这里是描述内容〃/> [0107] <meta 1131116 = 〃 (163 (; 14 :: 1 1〇11〃 (: = ^ 61 ^ 〇1 〃 〃 description herein is />

[0108] 关键词keywords,是一个网站管理者给网站某个页面设定的以便用户通过搜索引擎能搜到本网页的词汇,关键词代表了网站的市场定位。 [0108] keywords keywords, is so that users can be found to this page via a search engine vocabulary a website manager to the site a page set keyword represents the market positioning of the site. 描述description,也可称为“内容标签”,“描述标签”或“内容摘要”,反映网页的主要内容。 Description description, also referred to as "content label", "description tag" or "summary", reflecting the main content of the page.

[0109] 一般是完整的TKD才符合搜索引擎的搜索规则,如果TKD不完整,不符合搜索引擎的搜索规则,那么搜索引擎可能不会搜索该页面,或不收录该链接内容。 [0109] is generally complete only TKD match your search criteria search engine, if TKD incomplete, inconsistent with the rules of search engines search, then search engines may not search for the page, or do not include the link to the content. 因此,发现不完整时从sitemap中删除该URL并记录原因。 Therefore, it was found to remove the URL from your sitemap is not complete and document the reason.

[0110] 步骤3〇5、服务平台判断页面正文内容是否与TKD不匹配,如果是,进入步骤:m,从si temap中删除该URL并记录原因,如果否,进入步骤306。 [0110] Step 3〇5 service platform to determine whether the body of the page content does not match the TKD, if so, proceed to step: m, remove the URL from si temap and document the reason, and if not, proceed to step 306.

[0111] 该步骤中,根据页面里的正文内容,判断正文里是否出现TKD中的关键词,正文的内容是否与TKD的标题和描述对应,如果出现TKD中的关键词,正文的内容是与TKD的标题和描述是对应的,表示是匹配的,否则是不匹配的。 [0111] In this step, according to pages of text, the keywords TKD appears in the text is determined, the content of the text corresponds to the title and description of TKD, TKD if the keywords appear, with the content of the text TKD titles and descriptions are corresponding representation is matched, otherwise it is not matched. 如果是不匹配,那么可能是正文设置错了, 或者是TKD设置错了,这些都会影响搜索引擎的搜索质量和影响用户的浏览体验。 If it does not match, then set the body may be wrong, or the wrong set TKD, which will affect search quality of search engines and affect the user's browsing experience. 因此,发现页面正文内容是否与TKD不匹配时从sitemap中删除该URL并记录原因。 Therefore, it was found to remove the body of the page content does not match the URL whether TKD from the sitemap and document the reason.

[0112] 步骤3〇6、服务平台对页面的内容进行关键词提取,并对正文内容提取正文特征值。 [0112] Step 3〇6 service platform for content pages are keyword extraction, text extraction and text content feature value.

[0113]该步骤中,服务平台可利用现有的不同算法对页面的内容进行关键词提取,并对正文内容提取正文特征值,本发明不加以限定。 [0113] In this step, the service platform may utilize different algorithms to existing content page is extracted keyword, the extracted text and text content feature value, the present invention is not to be defined.

[0114]例如,关键词提取可米用现有的TFIDF (term frequency-inverse document frequency,词频--反转文件频率)算法,该算法主要是用一个字典来保存所有的词信息,然后按值value对字典排序,最后取权重排名靠前的几个词作为关键词。 Algorithm, which is used primarily to store a dictionary of all the word information, and by value - [0114] For example, a keyword may be extracted meters (inverted document frequency term frequency-inverse document frequency, term frequency) using conventional TFIDF value of the dictionary sort, and finally take the weight of the words as the top-ranked keyword. 例如,对正文内容提取正文特征值,可以采用基于语境框架的文本特征提取方法或基于本体论的文本特征提取方法等。 For example, to extract the text content of the text feature values, feature extraction methods can be used based on the text feature contextual framework text or ontology-based extraction method.

[0115]步骤3〇7、服务平台将提取的关键词和正文特征值与服务平台存储的关键词和正文特征值进行比较,检查是否存在重复提交内容的情况,如果是,进入步骤311,从sitemap 中删除该URL并记录原因,如果否,进入步骤308。 [0115] Step 3〇7, the extracted keywords service platform and the service platform and the body feature value stored keywords and text feature values ​​are compared to check whether there duplicate content is submitted, if so, proceeds to step 311, from To remove the sitemap URL and document the reason, and if not, proceed to step 308.

[0116]该步骤通过将提取的关键词和正文特征值与服务平台存储的关键词和正文特征值进行比较,来进行正文相关度的匹配,如果在服务平台找到了同样的关键词和正文特征值,判断为内容重复。 [0116] This step is carried out by comparing the keywords and text feature value extracted keywords and text feature values ​​stored in the service platform to match the body of the correlation, if the same keywords are found in the service platform and wherein the body value, it is determined that duplicate content. 通过该匹配检测,从而可以检查是否存在重复提交内容的情况。 By this match detection, so as to check whether there has been repeatedly submit content. 在服务平台,预先存储检测过的各页面文章的关键词和正文特征值。 In the service platform, keywords and text features pre-stored value TESTED each page of the article.

[0117]步骤3〇8,服务平台将提取的关键词、正文特征值及对应的链接进行保存,供后续查重所用。 [0117] Step 3〇8 service platform extracted keywords, text links and the corresponding feature values ​​saved for subsequent re-used in the investigation.

[0118] 步骤309、服务平台生成经过处理后的新的sitemap数据供搜索引擎获取。 [01] In step 309, the service platform generates a new sitemap after the processed data for the search engines get.

[0119] 该步骤中,可以在网站进行设置,指示搜索引擎直接到服务平台获取sitemap,或者,服务平台可以直接将新的sitemap替换网站的原来的sitemap。 [0119] In this step, can be set at the site, indicating the search engine directly to the service platform obtains sitemap, or the service platform may directly original new sitemap sitemap replacement site.

[0120] 步骤310、服务平台对最新的sitemap数据进行收录情况监控。 [0120] Step 310, the service platform for the latest sitemap data included monitoring the situation.

[0121] 如果sitemap的URL被搜索引擎收录,会返回标记信息,服务平台监控URL被搜索引擎收录的情况,可以为后续调整sitemap提供参考。 [0121] If the URL sitemap is indexed by search engines, will return tag information service platform to monitor the situation of the URL is indexed search engine that can provide a reference for subsequent adjustment sitemap.

[0122] 步骤311、服务平台从sitemap中删除该链接,并记录原因供网站进行分析。 [0122] In step 311, the service platform to remove the link from the sitemap and record the reasons for the site analysis.

[0123] 该步骤中,可详细记录该链接被删除的原因,供网站进行分析。 [0123] In this step, because the link may be deleted detailed records for site analysis.

[0124] 可以发现,本发明实施例的技术方案,对获取的网站的sitemap数据进行分析过滤,并对sitemap提供的链接进行访问验证,另外还对正文内容进行关键词提取和正文特征值提取,并与预存的关键词和正文特征值进行匹配,从而避免提交重复内容或者质量不好的内容。 [0124] can be found in the technical solutions of the embodiments of the present invention, the data acquisition site for sitemap analyze filter, and a link provided by the sitemap access authentication, in addition to the text content of a keyword and the extracted text feature extraction, and matched with the prestored keywords and text feature value, so as to avoid duplicate content submission or of poor quality. 最后还可以对搜索引擎对sitemap的收录情况进行监控。 Finally, the search engine can also monitor the situation included the sitemap. 通过上述处理,本发明可以优化sitemap的质量,提升网站内容被搜索引擎收录的收录数量,让搜索引擎更好的收录网站的页面,也解决了重复内容、垃圾内容提交到搜索引擎导致的搜索降权的问题,还可以更好的监控网站内容的情况。 Through the above process, the present invention can optimize the quality sitemap, enhance the number of indexed web content indexed by search engines to make the page better search engine included in the site, but also solve the duplicate content, spam submitted to the search engines due to lower search question right, you can also monitor the situation better website content.

[0125] 上述详细介绍了本发明的处理网站地图的方法,相应的,本发明还提供一种处理网站地图的装置。 [0125] Details of the above-described method of treating a site map of the present invention, Accordingly, the present invention also provides an apparatus for processing site map.

[0126] 图4是本发明的一种处理网站地图的装置的示意性方框图。 [0126] FIG. 4 is a schematic block diagram of a processing apparatus according to the present invention, the site map.

[0127] 如图4所示,一种处理网站地图的装置,包括:获取模块401、访问模块402、第一处理模块403、生成模块404。 [0127] As shown, an apparatus 4 sitemap processing, comprising: an acquisition module 401, access module 402, a first processing module 403, module 404 generates. 本发明的处理网站地图的装置,可以是服务平台或其他设备。 Site Map processing device of the present invention, can be a service platform or other device.

[0128] 获取模块401,用于根据预设信息获取网站的网站地图。 [0128] acquisition module 401, for access to the Web site map based on preset information.

[0129] 装置可以根据与网站协商一致后,由获取模块401根据网站提供的设置信息,获取网站的网站地图。 [0129] After the device according to the consensus with the site, set up by the acquisition module 401 according to information provided by the website, access to the Web site map.

[0130] 访问模块402,用于根据所述获取模块401获取的网站地图,获取网站地图中页面的链接并进行访问。 [0130] The access module 402, according to the acquiring module 401 acquires a site map, site map for a link in the page and access.

[0131] 访问模块402获取网站地图中的各URL链接,并对URL链接分别进行访问以进行验证。 [0131] Each access module 402 acquires URL link to the site map, and URL links are accessible for verification.

[0132] 第一处理模块403,用于根据所述访问模块402的访问结果删除网站地图中影响搜索收录的链接。 [0132] The first processing module 403 is configured to affect the search included a link to remove your site map based on the result of access to the access module 402.

[0133] 第一处理模块403根据各种不同访问结果删除网站地图中影响搜索收录的链接。 [0133] The first processing module 403 deletes the site map included links affect search according to a variety of different access results.

[0134] 生成模块404,用于在所述第一处理模块403进行处理后生成新网站地图。 [0134] generating module 404 for generating a new site map after the first processing module 403 for processing.

[0135] 图5是本发明的一种处理网站地图的装置的另一示意性方框图。 [0135] FIG. 5 is another schematic block diagram of a processing apparatus according to the present invention, the site map.

[0136] 如图5所示,一种处理网站地图的装置,包括:获取模块401、访问模块402、第一处理模块403、生成模块404,各模块的功能参见图4所述。 As shown in [0136] FIG 5 A sitemap processing apparatus, comprising: an acquisition module 401, access module 402, a first processing module 403, a generating module 404, the function of each module 4 Referring to FIG.

[0137] 此外,所述装置还包括:第二处理模块405。 [0137] In addition, the apparatus further comprising: a second processing module 405.

[0138] 第二处理模块405,用于对访问的页面提取关键词和正文特征值,根据提取的关键词和正文特征值与预存的关键词和正文特征值的比较结果,删除网站地图中影响搜索收录的链接;所述生成模块404在所述第一处理模块403和所述第二处理模块405进行处理后,生成新网站地图。 [0138] The second processing module 405 is configured to access the page to extract keywords and text feature value, based on a comparison of the value of keywords and text features extracted keywords and text eigenvalues ​​and pre-stored, delete the sitemap impact Search link included; the generating module 404 after the first process module 403 and the second processing module 405, and generates a new sitemap.

[0139] 第二处理模块405是根据提取的关键词和正文特征值与预存的关键词和正文特征值的比较结果是一致,判断为内容重复提交,删除对应的链接。 [0139] The second processing module 405 is based on a comparison of keywords and text feature value extracted keywords and text pre-stored feature value is consistent, it is determined repeatedly submit the content, to delete the corresponding link.

[0140] 所述装置还包括:输出模块406。 The [0140] apparatus further comprising: an output module 406.

[0141] 输出模块406,用于将所述生成模块生成的新网站地图提供给搜索引擎访问。 [0141] output module 406 for the generation module generates a new site map provides access to the search engine.

[0142] 本发明可以将生成的新网站地图替换网站原有的网站地图,供搜索引擎到网站访问新网站地图,也可以由网站进行设置,由搜索引擎直接到服务平台访问新网站地图,本发明不加以限定,只要能让搜索引擎访问新网站地图即可。 [0142] The present invention can generate a new site map site map to replace the original site for the search engines to visit the new website site map, can also be set by the site, the search engine directly to the service platform to access the new site map, this invention is not limited, so long as to make the search engine can access the new site map.

[0143] 所述装置还包括:监控模块407。 [0143] The apparatus further comprises: a monitoring module 407.

[0144] 监控模块407,用于记录所述搜索引擎访问新网站地图后进行搜索并收录的收录数据。 [0144] The monitoring module 407, after recording for the search engines to access the new site map to search and included the collection of data.

[0145] 其中,所述第一处理模块403包括:第一删除单元4〇3丨、第二删除单元4032、第三删除单元4033或第四删除单元4034。 [0145] wherein the first processing module 403 includes: a first deleting unit 4〇3 Shu, the second deletion unit 4032, a third deletion unit deleting unit 4033 or 4034 of the fourth.

[0146] 第一删除单元4031,用于在访问结果是出现无法访问的HTTP404错误时,删除对应的链接。 [0146] The first deletion unit 4031 for accessing result is inaccessible HTTP404 error appears, remove the corresponding link.

[0147] 第二删除单元4〇32,用于在访问结果是页面响应时间大于或等于设定阈值时,删除对应的链接。 [0147] 4〇32 second deleting unit, configured to access the results page response time is equal to or greater than the set threshold, remove the corresponding link.

[0148] 第三删除单元4〇33,用于在访问结果是页面的标题、关键词和描述不完整时,删除对应的链接。 [0148] The third deletion unit 4〇33, is used to access the results page title, keywords and description incomplete when, delete the corresponding link.

[0149] 第四删除单元4034,用于在访问结果是页面的正文内容与页面的标题、关键词和描述不匹配时,删除对应的链接。 [0149] The fourth deleting unit 4034 for accessing result is the title of the text and page content pages, keywords and description does not match, delete the corresponding link.

[0150] 本发明还提供一种处理设备。 [0150] The present invention also provides a processing apparatus.

[0151] 图6是本发明的一种处理设备的示意性方框图。 [0151] FIG. 6 is a schematic block diagram of a processing device according to the present invention.

[0152] 如图6所示,处理设备包括:存储器601和处理器602。 [0152] As shown in FIG. 6, the processing apparatus comprising: a processor 602 and memory 601.

[0153] 存储器601,用于存储程序, [0153] memory 601 for storing programs,

[0154] 处理器602,用于执行所述存储器601存储的以下程序: [0154] processor 602, for performing the program stored in the memory 601:

[0155] 根据预设信息获取网站的网站地图; [0155] According to the preset information access to the Web site map;

[0156] 获取网站地图中页面的链接并进行访问; [0156] for a link to the site map page and access;

[0157] 根据访问结果删除网站地图中影响搜索收录的链接; [0157] delete a site map included links affect search results based on access;

[0158] 生成新网站地图。 [0158] generate a new site map.

[0159] 需说明的是,存储器601存储的其他程序,具体参见前面方法流程中的描述,此处不再赘述,处理器602还用于执行存储器601存储的其他程序。 [0159] It should be noted that other programs stored in memory 601, see in particular the process described in the foregoing method, not further described herein, processor 602 is further configured to execute other programs stored in memory 601.

[0160] 综上所述,本发明实施例的技术方案,对获取的网站的sitemap数据进行分析过滤,对sitemap提供的链接进行访问验证,另外还对正文内容进行关键词提取和正文特征值提取,并与预存的关键词和正文特征值进行匹配,从而避免提交重复内容或者质量不好的内容。 [0160] In summary, the technical solutions of the embodiments of the present invention, the data acquisition site for sitemap analyzed filtered sitemap links provide authentication for access, in addition to the text content of a keyword and the extracted text feature value extraction , and matched with the prestored keywords and text feature value, so as to avoid duplicate content submission or of poor quality. 最后还可以对搜索引擎对sitemap的收录情况进行监控。 Finally, the search engine can also monitor the situation included the sitemap. 通过上述处理,本发明可以优化sitemap的质量,提升网站内容被搜索引擎收录的收录数量,让搜索引擎更好的收录网站的页面,也解决了重复内容、垃圾内容提交到搜索引擎导致的搜索降权的问题,还可以更好的监控网站内容的情况。 Through the above process, the present invention can optimize the quality sitemap, enhance the number of indexed web content indexed by search engines to make the page better search engine included in the site, but also solve the duplicate content, spam submitted to the search engines due to lower search question right, you can also monitor the situation better website content.

[0161]上文中己经参考附图详细描述了根据本发明的技术方案。 [0161] In the above have been described in detail with reference to the accompanying drawings aspect of the present invention.

[0162]此外,根据本发明的方法还可以实现为一种计算机程序,该计算机程序包括用于执行本发明的上述方法中限定的上述各步骤的计算机程序代码指令。 [0162] Further, according to the method of the present invention may also be implemented as a computer program, the computer program comprising computer program code instructions for each step of the method of the present invention defined. 或者,根据本发明的方法还可以实现为一种计算机程序产品,该计算机程序产品包括计算机可读介质,在该计算机可读介质上存储有用于执行本发明的上述方法中限定的上述功能的计算机程序。 Alternatively, the method according to the invention may also be implemented as a computer program product, the computer program product includes a computer-readable medium, readable in the computer storage medium having computer to perform the functions described in the above-described method of the present invention as defined in the program. 本领域技术人员还将明白的是,结合这里的公开所描述的各种示例性逻辑块、模块、电路和算法步骤可以被实现为电子硬件、计算机软件或两者的组合。 Those skilled in the art will also be appreciated that, herein incorporated various illustrative logical blocks described in this disclosure, modules, circuits, and algorithm steps may be implemented as electronic hardware, computer software, or both.

[0163]附图中的流程图和框图显示了根据本发明的多个实施例的系统和方法的可能实现的体系架构、功能和操作。 [0163] The flowchart and block diagrams in the architecture shows possible implementations of systems and methods according to various embodiments of the present invention, functionality, and operation. 在这点上,流程图或框图中的每个方框可以代表一个模块、程序段或代码的一部分,所述模块、程序段或代码的一部分包含一个或多个用于实现规定的逻辑功能的可执行指令。 In this regard, the flowchart or block diagrams may represent a portion of each block in a module, segment, or portion of code, a module, segment, or portion of code that comprises one or more devices for implementing the specified logical function executable instructions. 也应当注意,在有些作为替换的实现中,方框中所标记的功能也可以以不同于附图中所标记的顺序发生。 It should also be noted that, in some implementations, as an alternative, the block function can also be marked in a different order than as reference numerals occur. 例如,两个连续的方框实际上可以基本并行地执行, 它们有时也可以按相反的顺序执行,这依所涉及的功能而定。 For example, two consecutive blocks may in fact be executed substantially concurrently, they may sometimes be executed in the reverse order, depending upon the functionality involved may be. 也要注意的是,框图和/或流fe图中的母个万框、以及框图和/或流程图中的方框的组合,可以用执行规定的功能或操作的专用的基于硬件的系统来实现,或者可以用专用硬件与计算机指令的组合来实现。 Also it is noted that the block diagrams and / or flow of a female figure fe Wan block, and combinations of the block diagrams and / or flowchart block may perform a predetermined function or operation of dedicated hardware-based systems implemented, or may be special purpose hardware and computer instructions to implement.

[0164]以上已经描述了本发明的各实施例,上述说明是示例性的,并非穷尽性的,并且也不限于所披露的各实施例。 [0164] The foregoing has described the embodiments of the present invention, the foregoing description is exemplary and not intended to be exhaustive or limited to the disclosed embodiments. 在不偏离所说明的各实施例的范围和精神的情况下,对于本技术领域的普通技术人员来说许多修改和变更都是显而易见的。 In the case of each of the embodiments from the scope and spirit of the embodiments described without departing from, those of ordinary skill in the art Many modifications and variations will be apparent. 本文中所用术语的选择,旨在最好地解释各实施例的原理、实际应用或对市场中的技术的改进,或者使本技术领域的其它普通技术人员能理解本文披露的各实施例。 Others of ordinary skill in the selected term herein used, is intended to best explain the principles of the embodiments, the practical application or improvement in the art in the market, or to the art to understand the embodiments herein disclosed embodiments.

Claims (10)

  1. 1. 一种服务平台处理网站地图的方法,其中,服务平台与网站预先协商,由网站设置网站地图与服务平台的映射关系,以允许服务平台根据网站提供的设置信息对网站的网站地图进行处理,所述方法包括: 根据预设信息获取网站的网站地图; 获取网站地图中页面的链接并进行访问; 根据访问结果删除网站地图中影响搜索收录的链接; 生成新网站地图并使用新网站地图替换网站上原来的网站地图,其中,所述根据访问结果删除网站地图中影响搜索收录的链接包括: 在访问结果是出现无法访问的HTTP 404错误时,删除对应的链接;或, 在访问结果是页面响应时间大于或等于设定阈值时,删除对应的链接;或, 在访问结果是页面的标题、关键词和描述不完整时,删除对应的链接;或, 在访问结果是页面的正文内容与页面的标题、关键词和描述不匹配时,删 1. A method of processing service platform, site map, where the service platform in advance in consultation with the site, the mapping between the Site Settings Site map service platform to allow service platform to process the site's site map settings according to the information provided by the website , the method comprising: obtaining information according to a preset website site map; for a link site map page and access; affect the search included a link to remove your site map access based on the results; generate a new site map and replace with a new site map on the site of the original site map, wherein the deleted site map included affect the search results based on the access links include: when the visit is the result of an HTTP 404 error appears inaccessible, delete the corresponding link; or, in access result page response time is greater than or equal to a set threshold, remove the corresponding link; or, in the title of the page is the access result, and when the keyword described incomplete, delete the corresponding link; or, the access result is the text content of the page of the page the title, keywords and description does not match the time, delete 除对应的链接。 In addition to the corresponding link.
  2. 2. 根据权利要求1所述的方法,其特征在于,所述获取网站地图中页面的链接并进行访问之后还包括: 对访问的页面提取关键词和正文特征值; 根据提取的关键词和正文特征值与预存的关键词和正文特征值的比较结果,删除网站地图中影响搜索收录的链接。 After 2. The method according to claim 1, wherein, for a link to the site map page and access further comprises: extracting keywords and text page access characteristic value; keywords and text according to the extracted comparison of the results of keywords and text eigenvalues ​​eigenvalues ​​and pre-stored, deleted affect search site map link included.
  3. 3. 根据权利要求2所述的方法,其特征在于,所述根据提取的关键词和正文特征值与预存的关键词和正文特征值的比较结果,删除网站地图中影响搜索收录的链接包括: 根据提取的关键词和正文特征值与预存的关键词和正文特征值的比较结果是一致,判断为内容重复提交,删除对应的链接。 3. The method according to claim 2, wherein said text according to the extracted keywords and keyword feature values ​​and the comparison result of the feature values ​​stored in the body, remove site map links included affect the search comprises: Image and text based on a comparison of the feature value extracted keywords and text pre-stored feature value is consistent, it is determined repeatedly submit the content, to delete the corresponding link.
  4. 4. 根据权利要求1至3任一项所述的方法,其特征在于,所述方法还包括: 将生成的新网站地图提供给搜索引擎访问;或者在网站进行设置,指示搜索引擎直接在服务平台获取生成的新网站地图。 4. The method according to claims 1-3, characterized in that the method further comprises: generating a new site map provides access to a search engine; site or set, in the service indicating the search engine directly platform gets generated new site map.
  5. 5. 根据权利要求4所述的方法,其特征在于,所述方法还包括: 记录所述搜索引擎访问新网站地图后进行搜索并收录的收录数据。 The method according to claim 4, characterized in that, said method further comprising: performing search and collection of data included in the search engine after recording the new access site map.
  6. 6. —种处理网站地图的服务平台装置,其中,服务平台装置与网站预先协商,由网站设置网站地图与服务平台的映射关系,以允许服务平台装置根据网站提供的设置信息对网站地图进行处理,所述装置包括: 获取模块,用于根据预设信息获取网站的网站地图; 访问模块,用于根据所述获取模块获取的网站地图,获取网站地图中页面的链接并进行访问; 第一处理模块,用于根据所述访问模块的访问结果删除网站地图中影响搜索收录的链接; 生成模块,用于在所述第一处理模块进行处理后生成新网站地图并使用新网站地图替换网站上原来的网站地图, 其中,所述第一处理模块包括: 第一删除单元,用于在访问结果是出现无法访问的HTTP 404错误时,删除对应的链接; 或, 第二删除单元,用于在访问结果是页面响应时间大于或等于设定阈值时,删除对 6. - kind of device processing site map service platform, where the service platform device in advance in consultation with the site, set up by the mapping relationships Site map Site and service platform, to allow the device to process the service platform site map settings according to the information provided by the website , the apparatus comprising: acquiring means for acquiring information according to a preset map for the Web site; access module, the obtaining module according to the site map, site map for a link in the page and access; a first process module to delete a site map included links affect search results based on access to the access module; generating module configured to generate a new site map after processing in the first processing module and replace with a new site map on the site of the original site map, wherein the first processing module comprises: a first deletion unit configured to access the results are inaccessible HTTP 404 error occurs, delete the corresponding link; or a second deletion unit configured to access results page response time is equal to or greater than the set threshold, delete 应的链接;或, 第三删除单元,用于在访问结果是页面的标题、关键词和描述不完整时,删除对应的链接;或, 第四删除单元,用于在访问结果是页面的正文内容与页面的标题、关键词和描述不匹配时,删除对应的链接。 Should link; or, third deletion unit configured to access the results page is the title, keywords and description incomplete when, delete the corresponding link; or, delete the fourth unit configured to access the results is the text of the page the title and content of the page when keywords and description does not match, delete the corresponding link.
  7. 7.根据权利要求6所述的装置,其特征在于,所述装置还包括: 第二处理模块,用于对访问的页面提取关键词和正文特征值,根据提取的关键词和正文特征值与预存的关键词和正文特征值的比较结果,删除网站地图中影响搜索收录的链接; 所述生成模块在所述第一处理模块和所述第二处理模块进行处理后,生成新网站地图。 7. The device according to claim 6, characterized in that said apparatus further comprises: a second processing module configured to access the page and text feature value extracting keywords, keywords and text according to the extracted feature value and the comparison results stored keywords and text eigenvalues, delete the sitemap affect the search included a link; the post-generating module for processing the first processing module and the second processing module to generate a new site map.
  8. 8. 根据权利要求6所述的装置,其特征在于,所述装置还包括: 输出模块,用于将所述生成模块生成的新网站地图提供给搜索引擎访问•,或者用于在网站进行设置,指示搜索引擎直接在服务平台获取生成的新网站地图。 8. The apparatus according to claim 6, characterized in that said apparatus further comprises: an output module for the new generation module generates a site map search engines • providing, or provided at the site for indicating that the search engine gets generated directly in the new site map service platform.
  9. 9. 根据权利要求8所述的装置,其特征在于,所述装置还包括: 监控模块,用于记录所述搜索引擎访问新网站地图后进行搜索并收录的收录数据。 9. The apparatus according to claim 8, characterized in that said apparatus further comprising: a monitoring module, for recording the access to the search engine to search the new site map and inclusion data included.
  10. 10. —种处理设备,其特征在于,包括: 存储器,用于存储程序, 处理器,用于执行所述存储器存储的如权利要求1所述方法的程序。 10. - seed processing apparatus comprising: a memory for storing a program, a processor for executing the program according to a memory storage method as claimed in claim.
CN 201510676894 2015-10-16 2015-10-16 A method of processing site map, installations and equipment CN105260469B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN 201510676894 CN105260469B (en) 2015-10-16 2015-10-16 A method of processing site map, installations and equipment

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN 201510676894 CN105260469B (en) 2015-10-16 2015-10-16 A method of processing site map, installations and equipment
PCT/CN2016/102215 WO2017063596A1 (en) 2015-10-16 2016-10-14 Method, apparatus and device for processing sitemap

Publications (2)

Publication Number Publication Date
CN105260469A true CN105260469A (en) 2016-01-20
CN105260469B true CN105260469B (en) 2017-12-26

Family

ID=55100159

Family Applications (1)

Application Number Title Priority Date Filing Date
CN 201510676894 CN105260469B (en) 2015-10-16 2015-10-16 A method of processing site map, installations and equipment

Country Status (2)

Country Link
CN (1) CN105260469B (en)
WO (1) WO2017063596A1 (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105260469B (en) * 2015-10-16 2017-12-26 广州神马移动信息科技有限公司 A method of processing site map, installations and equipment
CN106095674A (en) * 2016-06-07 2016-11-09 百度在线网络技术(北京)有限公司 Website automated testing method and device

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1486457A (en) * 2000-11-21 2004-03-31 汤姆森许可公司 System and process for mediated crawling
CN102057372A (en) * 2008-04-17 2011-05-11 谷歌公司 Generating sitemaps
CN104317938A (en) * 2014-10-31 2015-01-28 北京国双科技有限公司 Webpage validation method and device

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7769742B1 (en) * 2005-05-31 2010-08-03 Google Inc. Web crawler scheduler that utilizes sitemaps from websites
US8126869B2 (en) * 2008-02-08 2012-02-28 Microsoft Corporation Automated client sitemap generation
US7865497B1 (en) * 2008-02-21 2011-01-04 Google Inc. Sitemap generation where last modified time is not available to a network crawler
CN105260469B (en) * 2015-10-16 2017-12-26 广州神马移动信息科技有限公司 A method of processing site map, installations and equipment

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1486457A (en) * 2000-11-21 2004-03-31 汤姆森许可公司 System and process for mediated crawling
CN102057372A (en) * 2008-04-17 2011-05-11 谷歌公司 Generating sitemaps
CN104317938A (en) * 2014-10-31 2015-01-28 北京国双科技有限公司 Webpage validation method and device

Also Published As

Publication number Publication date Type
WO2017063596A1 (en) 2017-04-20 application
CN105260469A (en) 2016-01-20 application

Similar Documents

Publication Publication Date Title
US7917514B2 (en) Visual and multi-dimensional search
US20080082490A1 (en) Rich index to cloud-based resources
US20090292696A1 (en) Computer-implemented search using result matching
US20050278317A1 (en) Personalized search engine
US20080091650A1 (en) Augmented Search With Error Detection and Replacement
US20040019499A1 (en) Information collecting apparatus, method, and program
US7536389B1 (en) Techniques for crawling dynamic web content
US20110041090A1 (en) Auditing a website with page scanning and rendering techniques
Thelwall Extracting accurate and complete results from search engines: Case study Windows Live
US20120259833A1 (en) Configurable web crawler
US20080244428A1 (en) Visually Emphasizing Query Results Based on Relevance Feedback
US7707161B2 (en) Method and system for creating a concept-object database
US20080016147A1 (en) Method of retrieving an appropriate search engine
US20110307436A1 (en) Pattern tree-based rule learning
US20110119220A1 (en) Rule-based validation of websites
US20140365502A1 (en) Determining Answers in a Question/Answer System when Answer is Not Contained in Corpus
CN102436564A (en) Method and device for identifying falsified webpage
US20120078874A1 (en) Search Engine Indexing
US20100275117A1 (en) Method and system for handling references in markup language documents
US8126930B2 (en) Micro-bucket testing for page optimization
US20120158724A1 (en) Automated web page classification
CN102622445A (en) User interest perception based webpage push system and webpage push method
CN102333122A (en) Downloaded resource provision method, device and system
CN102200980A (en) Method and system for providing network resources
CN102156737A (en) Method for extracting subject content of Chinese webpage

Legal Events

Date Code Title Description
C06 Publication
C10 Entry into substantive examination