WO2017063596A1 - Method, apparatus and device for processing sitemap - Google Patents

Method, apparatus and device for processing sitemap Download PDF

Info

Publication number
WO2017063596A1
WO2017063596A1 PCT/CN2016/102215 CN2016102215W WO2017063596A1 WO 2017063596 A1 WO2017063596 A1 WO 2017063596A1 CN 2016102215 W CN2016102215 W CN 2016102215W WO 2017063596 A1 WO2017063596 A1 WO 2017063596A1
Authority
WO
WIPO (PCT)
Prior art keywords
website
link
map
access
page
Prior art date
Application number
PCT/CN2016/102215
Other languages
French (fr)
Chinese (zh)
Inventor
梁捷
梁卡喆
Original Assignee
广州神马移动信息科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 广州神马移动信息科技有限公司 filed Critical 广州神马移动信息科技有限公司
Publication of WO2017063596A1 publication Critical patent/WO2017063596A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/29Geographical information databases
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/955Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]

Definitions

  • the present invention relates to the field of mobile internet technologies, and in particular, to a method, device and device for processing a website map.
  • Sitemaps can be used to make it easier for sites to tell search engines what pages are available on the site.
  • the simplest form of sitemap is the XML (Extensible Markup Language) file, which lists the URLs in the site and other metadata about each URL (the time of the last update, the frequency of the changes, and the relative The importance of other URLs on the site, etc., so that search engines can crawl the content of the site more intelligently.
  • a sitemap can be understood as a list of links on a website. Generating a sitemap and submitting it to a search engine can make the content of the site easy to include, including those that are deeper, which is a good way for a site to talk to a search engine.
  • the quality of the website links contained in the sitemaps provided by many websites is low, and there may be many problems, such as link errors, inconsistent or untimely updates of the links, etc., which will waste resources that the search engine crawls.
  • the present invention provides a method, device and device for processing a website map to improve the quality of website links in the website map.
  • a method for processing a website map includes: obtaining a website map of the website according to the preset information; obtaining a link of the page in the website map and performing the access; and deleting the influence of the search in the website map according to the result of the access Link; generate a new site map.
  • the method further includes: extracting a keyword and a text feature value from the accessed page; and extracting the keyword and the text feature value from the pre-stored keyword and the text feature value according to the extracted keyword Compare the results and delete the links in the sitemap that affect the search.
  • the deleting the link affecting the search inclusion in the website map according to the access result comprises: deleting the corresponding link when the access result is an HTTP 404 error that cannot be accessed; or, when the access result is a page response time greater than or equal to the set When the threshold is set, the corresponding link is deleted; or, when the result of the visit is that the title, keyword and description of the page are incomplete, the corresponding link is deleted; or, the result of the visit is the body content of the page and the title, keyword and When the description does not match, delete the corresponding link.
  • the deleting the link affecting the search in the website map according to the comparison result of the extracted keyword and the text feature value and the pre-stored keyword and the text feature value comprises: according to the extracted keyword and the text feature value and the pre-stored Keywords and body
  • the comparison result of the feature values is consistent, and it is determined that the content is repeatedly submitted, and the corresponding link is deleted.
  • the method further comprises: providing the generated new website map to the search engine for access.
  • the method further includes: recording the collected data that is searched and included by the search engine after accessing the new website map.
  • an apparatus for processing a website map includes: an obtaining module, configured to acquire a website map of the website according to the preset information; and an access module, configured to obtain a website map obtained according to the obtaining module, Obtaining a link of the page in the website map and accessing; the first processing module is configured to delete a link in the website map that affects the search and the inclusion according to the access result of the access module; and generate a module, configured to process in the first processing module After generating a new site map.
  • the device further includes: a second processing module, configured to extract a keyword and a text feature value from the accessed page, and compare the extracted keyword and the text feature value with the pre-stored keyword and the text feature value, Deleting a link in the website map that affects the search and inclusion; the generating module is configured to generate a new website map after the first processing module and the second processing module perform processing.
  • a second processing module configured to extract a keyword and a text feature value from the accessed page, and compare the extracted keyword and the text feature value with the pre-stored keyword and the text feature value, Deleting a link in the website map that affects the search and inclusion
  • the generating module is configured to generate a new website map after the first processing module and the second processing module perform processing.
  • the device further comprises: an output module, configured to provide the new website map generated by the generating module to the search engine for access.
  • the device further includes: a monitoring module, configured to record the collected data that is searched and included by the search engine after accessing the new website map.
  • a monitoring module configured to record the collected data that is searched and included by the search engine after accessing the new website map.
  • the first processing module includes: a first deleting unit, configured to delete a corresponding link when an access result is an HTTP 404 error that cannot be accessed; or a second deleting unit, where the access result is a page
  • the response time is greater than or equal to the set threshold, the corresponding link is deleted; or, the third deletion unit is used for accessing
  • the result is that when the title, keyword and description of the page are incomplete, the corresponding link is deleted; or, the fourth deleting unit is used to delete when the result of the access is that the content of the page does not match the title, keyword and description of the page. Corresponding link.
  • a processing apparatus comprising: a memory for storing a program, a processor, for executing a method comprising the following steps by calling a program stored in the memory: according to preset information Obtain a website map of the website; obtain a link to the page in the website map and access; delete the link affecting the search and inclusion in the website map according to the result of the visit; generate a new website map.
  • a computer readable medium having non-volatile program code executable by a processor, the program code causing the processor to perform the method described above.
  • the technical solution of the embodiment of the present invention accesses the link of the page in the website map, and after the link result is found to have a link affecting the search and inclusion, the link affecting the search and inclusion in the website map is deleted, and then generated.
  • New website map so that you can optimize the original website map of the website, try to avoid the various links in the website map with bad or error-prone links, which can improve the quality of the website map, and can also be added by the search engine. The possibility to meet the needs of websites and search engines.
  • FIG. 1 is a schematic flow chart of a method of processing a website map according to an embodiment of the present application
  • FIG. 2 is a schematic flow chart of a method of processing a website map according to another embodiment of the present application.
  • FIG. 3 is a schematic flow chart of a method of processing a website map according to still another embodiment of the present application.
  • FIG. 4 is a schematic block diagram of an apparatus for processing a website map according to an embodiment of the present application.
  • FIG. 5 is a schematic block diagram of an apparatus for processing a website map according to another embodiment of the present application.
  • FIG. 6 is a schematic block diagram of a processing device in accordance with still another embodiment of the present application.
  • the application of the present invention provides a method for processing a website map, which can improve the quality of website links in the website map and meet the respective needs of the website and the search engine.
  • FIG. 1 is a schematic flow chart of a method of processing a website map in accordance with one embodiment of the present application.
  • Step S101 Obtain a website map of the website according to the preset information.
  • the website map of the website may be obtained according to the setting information provided by the website after being negotiated with the website.
  • Step S102 Obtain a link of a page in the website map and perform access.
  • a link is obtained for each page in the website map, and the links of the respective pages are links of respective URLs (Uniform Resource Locators), and the URL links are separately accessed for verification.
  • URLs Uniform Resource Locators
  • Step S103 Delete a link in the website map that affects the search and inclusion according to the access result.
  • the link in the site map has an inaccessible HTTP 404 error, too long access time, or other issues, it will affect the search engine's inclusion of the site link.
  • the links affecting the search inclusion in the website map are deleted according to the result of the visit, including:
  • Step S104 Generate a new website map.
  • the method may further include: providing the generated new website map to the search engine for access.
  • the recorded data may be recorded and searched after the search engine accesses the new website map.
  • the technical solution of the embodiment of the present invention firstly accesses the link of the page in the website map, and after the link result is found to affect the search and the link included in the search, the link affecting the search and inclusion in the website map is deleted, and a new link is generated.
  • Site map so that you can optimize the original site map of the site, try to avoid the various links in the site map are not good or error-prone links, which can improve the quality of the site map, and can also be added by the search engine. Possibility to meet the needs of websites and search engines.
  • FIG. 2 is a schematic flow chart of a method of processing a website map in accordance with another embodiment of the present application.
  • Step S201 Obtain a website map of the website according to the preset information.
  • step S101 This step is described in the above description of step S101.
  • Step S202 Obtain a link of a page in the website map and perform an access.
  • step S102 This step is described in the above description of step S102.
  • Step S203 Delete a link in the website map that affects the search and inclusion according to the access result.
  • step S103 This step is described in the above description of step S103.
  • Step S204 extracting keywords and text feature values from the accessed pages.
  • the content of the page may be extracted by using different existing algorithms, and the feature value of the text is extracted from the content of the text.
  • the application of the present invention is not limited.
  • Step S205 Delete a link affecting the search and inclusion in the website map according to the comparison result between the extracted keyword and the text feature value and the pre-stored keyword and the text feature value.
  • the extracted keyword and the text feature value are compared with the pre-stored keyword and the text feature value, respectively, to determine whether the comparison result is consistent.
  • the extracted keywords are compared with the pre-stored keywords, and the extracted text feature values are compared with the pre-stored text feature values.
  • the comparison results are all consistent, the extracted keywords and the body feature values may be considered. It is consistent with the comparison result of the pre-stored keyword and the text feature value, and it is judged that the content is repeatedly submitted, and the corresponding link is deleted.
  • Step S206 generating a new website map.
  • step S207 may be further included, and the generated new website map is provided to the search engine for access.
  • the generated new website map may be replaced with the original website map of the website for the search engine to visit the website to access the new website map, or may be set by the website, and the search engine directly accesses the service platform that generates the new website map.
  • the new website map, the application of the present invention is not limited, as long as the search engine can access the new website map.
  • the method further includes: recording the collected data that is searched and included by the search engine after accessing the new website map.
  • the link affecting the search and inclusion in the website map can be deleted according to the result of the access, and the comparison result between the extracted keyword and the text feature value and the pre-stored keyword and the text feature value can be deleted.
  • the links in the site map that affect search inclusion provide optimization.
  • the search engine may also record the collected data after the search engine accesses the new website map, thereby providing reference for the subsequent website map modification or for analyzing the website.
  • FIG. 3 is a schematic flowchart of a method for processing a website map according to still another embodiment of the present application.
  • a website map is processed by a sitemap service platform.
  • FIG. 3 including:
  • Step S301 The sitemap service platform performs data extraction on the sitemap of the website according to the setting information of the website.
  • the website and the sitemap service platform (hereinafter referred to as the service platform) are pre-negotiated, and the mapping relationship between the sitemap and the service platform is set by the website, and the service platform is allowed to process the sitemap according to the setting information provided by the website, for example, the address information.
  • Website setting mapping relationships can be implemented in XML.
  • the service platform can extract data from the sitemap according to the setting information provided by the website, and obtain the URL information of each link therein.
  • Step S302 The service platform checks the URLs in the extracted sitemap to determine whether the access URL has an HTTP404 error that cannot be accessed. If yes, step S311 is performed to delete the URL from the sitemap and record the reason. If not, step S303 is performed. .
  • the HTTP404 error means that the webpage pointed to by the link does not exist, that is, the URL of the original webpage is invalid. This often happens, for example, the URL URL generation rule is changed, the webpage file is renamed or moved, and the import link is misspelled.
  • the original URL address cannot be accessed; when the web server receives a similar request, it will return a 404 status code, telling the browser that the requested resource does not exist. Therefore, when an HTTP404 error that the URL cannot access is displayed, it indicates that the URL has expired. At this time, the URL is deleted from the sitemap and the reason is recorded.
  • Step S303 The service platform determines whether the page response speed of the access URL is abnormal. If yes, step S311 is performed to delete the URL from the sitemap and record the reason. If not, step S304 is performed.
  • the response speed of the page is detected, and the response speed can be measured by the response time. If the response time is greater than or equal to the set threshold, the response speed is considered abnormal, and if it is less than the set threshold, the response speed is considered normal.
  • the threshold value can be set according to experience, for example, set to 500 milliseconds or 1 second, which is not limited by the application of the present invention.
  • the page history access response speed can be compared with the current access response speed to determine whether the response speed is abnormal. If the current response time is much larger than the historical response time and the difference exceeds a certain threshold, the response speed is considered abnormal.
  • the page response speed is abnormal, it indicates that the page corresponding to the URL may have a problem or the network connection corresponding to the URL may have a problem, which may affect the browsing experience of the user.
  • the URL is deleted from the sitemap and the reason is recorded.
  • Step S304 The service platform determines whether the TKD of the page is incomplete. If yes, step S311 is performed to delete the URL from the sitemap and record the reason. If not, step S305 is performed.
  • TKD is an abbreviation for title title, keyword keywords, description description.
  • the format of TKD can be as follows:
  • the keyword keyword is a vocabulary that a website administrator sets for a certain page of the website so that the user can search the webpage through the search engine, and the keyword represents the market positioning of the website. Description description, also known as “content label”, “description label” or “content summary”, reflecting the main content of the web page.
  • the complete TKD is in line with the search engine search rules. If the TKD is incomplete and does not conform to the search engine's search rules, the search engine may not search the page or include the link content. Therefore, it is found that the URL is deleted from the sitemap when the TKD is incomplete and the reason is recorded.
  • Step S305 The service platform determines whether the content of the page body does not match the TKD. If yes, step S311 is performed to delete the URL from the sitemap and record the reason. If not, step S306 is performed.
  • the keyword in the TKD appears in the body text, and whether the content of the body text corresponds to the title and description of the TKD. If the keyword in the TKD appears, the content of the body text is the title of the TKD. Corresponding to the description, indicating that it is a match, otherwise it does not match. If it doesn't match, it may be that the body is set incorrectly, or the TKD setting is wrong. These will affect the search engine's search quality and affect the user's browsing experience. Therefore, if the content of the page body is found to be mismatched with TKD, the URL is deleted from the sitemap and the reason is recorded.
  • Step S306 The service platform performs keyword extraction on the content of the page, and extracts a text feature value for the body content.
  • the service platform may perform keyword extraction on the content of the page by using different existing algorithms, and extract the feature value of the body text, which is not limited by the application.
  • the keyword extraction can use the existing TFIDF (term frequency-inverse document frequency) algorithm, which mainly uses a dictionary to save all word information, and then sorts the dictionary by value value. Finally, the top words with the weights are ranked as keywords.
  • TFIDF term frequency-inverse document frequency
  • the text feature extraction method based on the context framework or the text feature extraction method based on the ontology may be adopted for extracting the text feature value from the body content.
  • Step S307 The service platform compares the extracted keyword and the text feature value with the keyword and the text feature value stored by the service platform, and checks whether there is a case of repeatedly submitting the content. If yes, step S311 is performed to delete the URL from the sitemap. And the reason is recorded, and if no, step S308 is performed.
  • the extracted keyword and the text feature value are compared with the keyword and the text feature value stored by the service platform, and the body relevance is matched. If the same keyword and body feature value are found on the service platform, the judgment is performed. Repeat for content. By this matching detection, it is possible to check whether there is a case where the content is repeatedly submitted. On the service platform, the detected keyword and body feature values of each page are stored in advance.
  • step S308 the service platform saves the extracted keywords, text feature values, and corresponding links for use in subsequent checksums.
  • Step S309 The service platform generates the processed new sitemap data for the search engine to acquire.
  • the website may be set to instruct the search engine to directly obtain the sitemap from the service platform, or the service platform may directly replace the new sitemap with the original sitemap of the website.
  • Step S310 The service platform monitors the latest sitemap data.
  • the tag information will be returned, and the service platform monitoring URL is included in the search engine, which can provide reference for subsequent adjustment of the sitemap.
  • Step S311 The service platform deletes the link from the sitemap, and records the reason for the website to analyze.
  • the technical solution of the embodiment of the present invention analyzes and filters the sitemap data of the obtained website, and performs access verification on the link provided by the sitemap, and further extracts the keyword content and extracts the feature value of the body content, and Pre-stored keywords and body feature values are matched to avoid submitting duplicate content or poor quality content.
  • the search engine can monitor the inclusion of the sitemap.
  • the application of the present invention can optimize the quality of the sitemap, improve the content of the website content collected by the search engine, enable the search engine to better include the website page, and improve the search caused by the duplicate content and the spam submission to the search engine.
  • the issue of power reduction can also better monitor the content of the website.
  • the embodiment of the present invention further provides an apparatus for processing a website map.
  • FIG. 4 is a schematic block diagram of an apparatus for processing a website map according to an embodiment of the present application.
  • an apparatus for processing a website map includes: an obtaining module 401, an access module 402, a first processing module 403, and a generating module 404.
  • the device for processing a website map according to an embodiment of the present application may be a service platform or the like.
  • the obtaining module 401 is configured to obtain a website map of the website according to the preset information.
  • the obtaining module 401 obtains the website map of the website according to the setting information provided by the website.
  • the access module 402 is configured to obtain a link of the page in the website map and access according to the website map acquired by the obtaining module 401.
  • the access module 402 obtains each URL link in the website map and separately accesses the URL link for verification.
  • the first processing module 403 is configured to delete a link in the website map that affects the search and inclusion according to the access result of the access module 402.
  • the first processing module 403 deletes the links in the website map that affect the search listing according to various different access results.
  • the generating module 404 is configured to generate a new website map after the processing by the first processing module 403.
  • FIG. 5 is another schematic block diagram of an apparatus for processing a website map according to an embodiment of the present application.
  • an apparatus for processing a website map includes: an obtaining module 401, an access module 402, a first processing module 403, and a generating module 404. The functions of each module are described in FIG.
  • the device further includes: a second processing module 405.
  • the second processing module 405 is configured to extract a keyword and a text feature value from the accessed page, and delete a search result included in the website map according to the comparison result between the extracted keyword and the text feature value and the pre-stored keyword and the text feature value.
  • the generating module 404 is configured to generate a new website map after the first processing module 403 and the second processing module 405 perform processing.
  • the second processing module 405 determines that the content is repeatedly submitted and deletes the corresponding link according to the comparison result between the extracted keyword and the text feature value and the pre-stored keyword and the text feature value.
  • the device also includes an output module 406.
  • the output module 406 is configured to provide a new website map generated by the generating module to the search engine for access.
  • the embodiment of the present invention can replace the generated new website map with the original website map of the website, and the search engine can access the new website map by the website, or can be set by the website, and the search engine directly accesses the service platform to access the new website map.
  • the embodiment of the invention is not limited as long as the search engine can access the new website map.
  • the device also includes a monitoring module 407.
  • the monitoring module 407 is configured to record the collected data that is searched and included by the search engine after accessing the new website map.
  • the first processing module 403 includes: a first deleting unit 4031, a second deleting unit 4032, a third deleting unit 4033, or a fourth deleting unit 4034.
  • the first deleting unit 4031 is configured to delete the corresponding link when the access result is an HTTP 404 error that cannot be accessed.
  • the second deleting unit 4032 is configured to delete the corresponding link when the access result is that the page response time is greater than or equal to the set threshold.
  • the third deleting unit 4033 is configured to delete the corresponding link when the access result is that the title, the keyword, and the description of the page are incomplete.
  • the fourth deleting unit 4034 is configured to delete the corresponding link when the access result is that the body content of the page does not match the title, keyword, and description of the page.
  • the embodiment of the present application further provides a processing device.
  • FIG. 6 is a schematic block diagram of a processing device 600 in accordance with one embodiment of the present application.
  • the processing device 600 provided by the embodiment of the present application includes: a memory 601 and a processor 602.
  • the memory 601 is used to store programs.
  • the processor 602 is configured to execute a method including the following steps by calling a program stored in the memory 601:
  • the processor 602 executes various functions and data processing by executing the above-mentioned program stored in the memory 601, that is, the method for processing the website map in the embodiment of the present application.
  • the memory 601 can include, but is not limited to, a random access memory (RAM), a read only memory (ROM), a programmable read only memory (PROM), and an erasable read only. Erasable Programmable Read-Only Memory (EPROM), Electric Erasable Programmable Read-Only Memory (EEPROM), and the like.
  • the processor 602 can execute the foregoing method stored in the memory 601 after receiving the execution instruction, and implement the method defined in the flow disclosed in any embodiment of the present application.
  • Processor 602 can be an integrated circuit chip with signal processing capabilities.
  • the processor may be a general-purpose processor, including a central processing unit (CPU), a network processor (NP processor, etc.), or a digital signal processor (DSP) or an application specific integrated circuit (ASIC). ), off-the-shelf programmable gate arrays (FPGAs) or other programmable logic devices, discrete gates or transistor logic devices, discrete hardware components.
  • the methods, steps, and logical block diagrams disclosed in the embodiments of the present application can be implemented or executed.
  • the general purpose processor may be a microprocessor or the processor or any conventional processor or the like.
  • FIG. 6 is merely illustrative, and the processing device 600 may further include more or less components than those shown in FIG. 6, or have a different configuration from that shown in FIG. 6.
  • the components shown in Figure 6 can be implemented in hardware, software, or a combination thereof.
  • the modules and units of the apparatus for processing a website map in the foregoing embodiments may be implemented by software code.
  • the modules and units described above may be stored in the memory 601 of the processing device 600.
  • the above modules and units can also be implemented by hardware such as an integrated circuit chip.
  • the technical solution of the embodiment of the present invention analyzes and filters the sitemap data of the obtained website, performs access verification on the link provided by the sitemap, and performs keyword extraction and text feature value extraction on the body content, and Matches pre-stored keyword and body feature values to avoid submitting duplicate content or poor quality content.
  • the search engine can monitor the inclusion of the sitemap.
  • the above method according to an embodiment of the present application may also be implemented as a computer program comprising computer program code for causing the processor to execute the method provided by the above embodiment of the present application when executed by the processor instruction.
  • the method according to the present application may also be embodied as a computer program product comprising a computer readable medium having processor-executable non-volatile program code, the program code being processed as described The apparatus is executed to perform the above method of the present application.
  • the various illustrative logical blocks, modules, circuits, and algorithm steps described in connection with the disclosure herein may be implemented as electronic hardware, computer software, or combinations of both.
  • each block of the flowchart or block diagram can represent a module, a program segment, or a portion of code that includes one or more of the Executable instructions.
  • the functions noted in the blocks may also occur in a different order than the ones in the drawings. For example, two consecutive blocks may be executed substantially in parallel, and they may sometimes be executed in the reverse order, depending upon the functionality involved.
  • each block of the block diagrams and/or flowcharts, and combinations of blocks in the block diagrams and/or flowcharts can be implemented in a dedicated hardware-based system that performs the specified function or operation. Or it can be implemented by a combination of dedicated hardware and computer instructions.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Remote Sensing (AREA)

Abstract

A method for processing a sitemap, an apparatus and a device. The method comprises: acquiring the sitemap of a website according to pre-set information (101); acquiring the links of pages in the sitemap and accessing same (102); according to the access result, deleting links recorded in the sitemap which affect search thereof (103); and generating a new sitemap (104). According to the method, the quality of sitemaps can be improved, and the possibility of recording by a search engine can also be increased, thus satisfying the respective requirements of the website and of the search engine.

Description

一种处理网站地图的方法、装置及设备Method, device and device for processing website map
本发明申请要求于2015年10月16日提交中国专利局、申请号为CN201510676894.0、发明名称为“一种处理网站地图的方法、装置及设备”的中国专利申请的优先权,其全部内容通过引用结合在本发明申请中。The present application claims the priority of the Chinese Patent Application filed on October 16, 2015, the Chinese Patent Application No. CN201510676894.0, entitled "A Method, Apparatus and Apparatus for Processing Website Maps", the entire contents of which are hereby incorporated by reference. The present application is incorporated by reference.
技术领域Technical field
本发明申请涉及移动互联网技术领域,具体涉及一种处理网站地图的方法、装置及设备。The present invention relates to the field of mobile internet technologies, and in particular, to a method, device and device for processing a website map.
背景技术Background technique
目前,搜索引擎通常会通过网站(也称为站点)内部和其他网站上的链接查找网页,网站地图sitemap可方便网站通知搜索引擎在网站上有哪些可供抓取的网页。最简单的sitemap形式,就是XML(Extensible Markup Language,可扩展标记语言)文件,在其中列出网站中的网址以及关于每个网址的其他元数据(上次更新的时间、更改的频率以及相对于网站上其他网址的重要程度等),以便搜索引擎可以更加智能地抓取网站内容。简单来讲,sitemap可以理解为网站上链接的列表。生成sitemap并提交给搜索引擎,可以使网站的内容容易被收录,包括那些隐藏比较深的页面,这是一种网站与搜索引擎对话的好方式。Currently, search engines often find pages through links within websites (also known as sites) and other sites. Sitemaps can be used to make it easier for sites to tell search engines what pages are available on the site. The simplest form of sitemap is the XML (Extensible Markup Language) file, which lists the URLs in the site and other metadata about each URL (the time of the last update, the frequency of the changes, and the relative The importance of other URLs on the site, etc., so that search engines can crawl the content of the site more intelligently. Simply put, a sitemap can be understood as a list of links on a website. Generating a sitemap and submitting it to a search engine can make the content of the site easy to include, including those that are deeper, which is a good way for a site to talk to a search engine.
但是,目前很多网站提供的sitemap里面包含的网站链接的质量较低,有可能出现不少问题,例如链接错误,链接的内容劣质或未及时更新等,这些情况都会浪费搜索引擎抓取的资源,这样就导致了虽然网站提供了sitemap,但搜索引擎根据 抓取的结果,并不一定会收录sitemap的网站链接,同时还可能触发搜索引擎的降权规则,减少对该网站收录的链接数量和降低该网站的搜索排序等。However, the quality of the website links contained in the sitemaps provided by many websites is low, and there may be many problems, such as link errors, inconsistent or untimely updates of the links, etc., which will waste resources that the search engine crawls. This leads to the fact that although the site provides a sitemap, the search engine is based on The result of the crawl does not necessarily include the sitemap's website link, and may also trigger the search engine's demotion rules, reduce the number of links to the site, and reduce the search ranking of the site.
发明内容Summary of the invention
为解决上述技术问题,本发明申请提供一种处理网站地图的方法、装置及设备,以提高网站地图中网站链接的质量。In order to solve the above technical problem, the present invention provides a method, device and device for processing a website map to improve the quality of website links in the website map.
根据本发明申请的一个方面,提供一种处理网站地图的方法,包括:根据预设信息获取网站的网站地图;获取网站地图中页面的链接并进行访问;根据访问结果删除网站地图中影响搜索收录的链接;生成新网站地图。According to an aspect of the present application, a method for processing a website map includes: obtaining a website map of the website according to the preset information; obtaining a link of the page in the website map and performing the access; and deleting the influence of the search in the website map according to the result of the access Link; generate a new site map.
优选地,所述获取网站地图中页面的链接并进行访问之后还包括:对访问的页面提取关键词和正文特征值;根据提取的关键词和正文特征值与预存的关键词和正文特征值的比较结果,删除网站地图中影响搜索收录的链接。Preferably, after the obtaining the link of the page in the website map and performing the access, the method further includes: extracting a keyword and a text feature value from the accessed page; and extracting the keyword and the text feature value from the pre-stored keyword and the text feature value according to the extracted keyword Compare the results and delete the links in the sitemap that affect the search.
优选地,所述根据访问结果删除网站地图中影响搜索收录的链接包括:在访问结果是出现无法访问的HTTP 404错误时,删除对应的链接;或,在访问结果是页面响应时间大于或等于设定阈值时,删除对应的链接;或,在访问结果是页面的标题、关键词和描述不完整时,删除对应的链接;或,在访问结果是页面的正文内容与页面的标题、关键词和描述不匹配时,删除对应的链接。Preferably, the deleting the link affecting the search inclusion in the website map according to the access result comprises: deleting the corresponding link when the access result is an HTTP 404 error that cannot be accessed; or, when the access result is a page response time greater than or equal to the set When the threshold is set, the corresponding link is deleted; or, when the result of the visit is that the title, keyword and description of the page are incomplete, the corresponding link is deleted; or, the result of the visit is the body content of the page and the title, keyword and When the description does not match, delete the corresponding link.
优选地,所述根据提取的关键词和正文特征值与预存的关键词和正文特征值的比较结果,删除网站地图中影响搜索收录的链接包括:根据提取的关键词和正文特征值与预存的关键词和正文 特征值的比较结果是一致,判断为内容重复提交,删除对应的链接。Preferably, the deleting the link affecting the search in the website map according to the comparison result of the extracted keyword and the text feature value and the pre-stored keyword and the text feature value comprises: according to the extracted keyword and the text feature value and the pre-stored Keywords and body The comparison result of the feature values is consistent, and it is determined that the content is repeatedly submitted, and the corresponding link is deleted.
优选地,所述方法还包括:将生成的新网站地图提供给搜索引擎访问。Preferably, the method further comprises: providing the generated new website map to the search engine for access.
优选地,所述方法还包括:记录所述搜索引擎访问新网站地图后进行搜索并收录的收录数据。Preferably, the method further includes: recording the collected data that is searched and included by the search engine after accessing the new website map.
根据本发明申请的另一方面,提供一种处理网站地图的装置,包括:获取模块,用于根据预设信息获取网站的网站地图;访问模块,用于根据所述获取模块获取的网站地图,获取网站地图中页面的链接并进行访问;第一处理模块,用于根据所述访问模块的访问结果删除网站地图中影响搜索收录的链接;生成模块,用于在所述第一处理模块进行处理后生成新网站地图。According to another aspect of the present application, an apparatus for processing a website map includes: an obtaining module, configured to acquire a website map of the website according to the preset information; and an access module, configured to obtain a website map obtained according to the obtaining module, Obtaining a link of the page in the website map and accessing; the first processing module is configured to delete a link in the website map that affects the search and the inclusion according to the access result of the access module; and generate a module, configured to process in the first processing module After generating a new site map.
优选地,所述装置还包括:第二处理模块,用于对访问的页面提取关键词和正文特征值,根据提取的关键词和正文特征值与预存的关键词和正文特征值的比较结果,删除网站地图中影响搜索收录的链接;所述生成模块用于在所述第一处理模块和所述第二处理模块进行处理后,生成新网站地图。Preferably, the device further includes: a second processing module, configured to extract a keyword and a text feature value from the accessed page, and compare the extracted keyword and the text feature value with the pre-stored keyword and the text feature value, Deleting a link in the website map that affects the search and inclusion; the generating module is configured to generate a new website map after the first processing module and the second processing module perform processing.
优选地,所述装置还包括:输出模块,用于将所述生成模块生成的新网站地图提供给搜索引擎访问。Preferably, the device further comprises: an output module, configured to provide the new website map generated by the generating module to the search engine for access.
优选地,所述装置还包括:监控模块,用于记录所述搜索引擎访问新网站地图后进行搜索并收录的收录数据。Preferably, the device further includes: a monitoring module, configured to record the collected data that is searched and included by the search engine after accessing the new website map.
优选地,所述第一处理模块包括:第一删除单元,用于在访问结果是出现无法访问的HTTP 404错误时,删除对应的链接;或,第二删除单元,用于在访问结果是页面响应时间大于或等于设定阈值时,删除对应的链接;或,第三删除单元,用于在访问 结果是页面的标题、关键词和描述不完整时,删除对应的链接;或,第四删除单元,用于在访问结果是页面的正文内容与页面的标题、关键词和描述不匹配时,删除对应的链接。Preferably, the first processing module includes: a first deleting unit, configured to delete a corresponding link when an access result is an HTTP 404 error that cannot be accessed; or a second deleting unit, where the access result is a page When the response time is greater than or equal to the set threshold, the corresponding link is deleted; or, the third deletion unit is used for accessing The result is that when the title, keyword and description of the page are incomplete, the corresponding link is deleted; or, the fourth deleting unit is used to delete when the result of the access is that the content of the page does not match the title, keyword and description of the page. Corresponding link.
根据本发明申请的另一方面,提供一种处理设备,包括:存储器,用于存储程序,处理器,用于通过调用所述存储器中存储的程序,执行包括以下步骤的方法:根据预设信息获取网站的网站地图;获取网站地图中页面的链接并进行访问;根据访问结果删除网站地图中影响搜索收录的链接;生成新网站地图。According to another aspect of the present application, there is provided a processing apparatus comprising: a memory for storing a program, a processor, for executing a method comprising the following steps by calling a program stored in the memory: according to preset information Obtain a website map of the website; obtain a link to the page in the website map and access; delete the link affecting the search and inclusion in the website map according to the result of the visit; generate a new website map.
根据本发明申请的另一方面,提供一种具有处理器可执行的非易失的程序代码的计算机可读介质,所述程序代码使所述处理器执行上述方法。In accordance with another aspect of the present application, a computer readable medium having non-volatile program code executable by a processor, the program code causing the processor to perform the method described above.
可以发现,本发明申请实施例的技术方案,通过获取网站地图中页面的链接后先进行访问,根据访问结果发现有影响搜索收录的链接后,就删除网站地图中影响搜索收录的链接,再生成新网站地图,这样就可以实现对网站的原先的网站地图进行优化处理,尽量避免网站地图中出现各种内容不好或容易出错的链接,从而可以提升网站地图质量,也可以增加被搜索引擎收录的可能性,满足网站和搜索引擎的需求。It can be found that the technical solution of the embodiment of the present invention accesses the link of the page in the website map, and after the link result is found to have a link affecting the search and inclusion, the link affecting the search and inclusion in the website map is deleted, and then generated. New website map, so that you can optimize the original website map of the website, try to avoid the various links in the website map with bad or error-prone links, which can improve the quality of the website map, and can also be added by the search engine. The possibility to meet the needs of websites and search engines.
附图说明DRAWINGS
通过结合附图对本发明申请示例性实施方式进行更详细的描述,本发明申请的上述以及其它目的、特征和优势将变得更加明显,其中,在本发明申请示例性实施方式中,相同的参考标号通常代表相同部件。The above and other objects, features, and advantages of the present invention will become more apparent from the aspects of the embodiments of the present invention. The reference numbers generally represent the same components.
图1是根据本发明申请的一个实施例的处理网站地图的方法的示意性流程图; 1 is a schematic flow chart of a method of processing a website map according to an embodiment of the present application;
图2是根据本发明申请的另一个实施例的处理网站地图的方法的示意性流程图;2 is a schematic flow chart of a method of processing a website map according to another embodiment of the present application;
图3是根据本发明申请的再一个实施例的处理网站地图的方法的示意性流程图;3 is a schematic flow chart of a method of processing a website map according to still another embodiment of the present application;
图4是本发明申请的一个实施例的一种处理网站地图的装置的示意性方框图;4 is a schematic block diagram of an apparatus for processing a website map according to an embodiment of the present application;
图5是本发明申请的另一个实施例的一种处理网站地图的装置的示意性方框图;FIG. 5 is a schematic block diagram of an apparatus for processing a website map according to another embodiment of the present application; FIG.
图6是本发明申请的再一个实施例的一种处理设备的示意性方框图。Figure 6 is a schematic block diagram of a processing device in accordance with still another embodiment of the present application.
具体实施方式detailed description
下面将参照附图更详细地描述本发明申请的优选实施方式。虽然附图中显示了本发明的优选实施方式,然而应该理解,可以以各种形式实现本发明申请而不应被这里阐述的实施方式所限制。相反,提供这些实施方式是为了使本发明申请更加透彻和完整,并且能够将本公开的范围完整地传达给本领域的技术人员。Preferred embodiments of the present application will be described in more detail below with reference to the accompanying drawings. While the invention has been described in terms of the preferred embodiments of the present invention, it should be understood that Rather, these embodiments are provided so that this disclosure will be thorough and complete.
同时,在本发明申请的描述中,术语“第一”、“第二”等仅用于区分描述,而不能理解为指示或暗示相对重要性。Also, in the description of the present application, the terms "first", "second", and the like are used merely to distinguish a description, and are not to be construed as indicating or implying a relative importance.
本发明申请提供一种处理网站地图的方法,能提高网站地图中网站链接的质量,满足网站和搜索引擎各自的需要。The application of the present invention provides a method for processing a website map, which can improve the quality of website links in the website map and meet the respective needs of the website and the search engine.
图1是根据本发明申请的一个实施例的处理网站地图的方法的示意性流程图。1 is a schematic flow chart of a method of processing a website map in accordance with one embodiment of the present application.
如图1所示,包括:As shown in Figure 1, it includes:
步骤S101、根据预设信息获取网站的网站地图。 Step S101: Obtain a website map of the website according to the preset information.
该步骤中,可以根据与网站协商一致后,网站提供的设置信息,获取网站的网站地图。In this step, the website map of the website may be obtained according to the setting information provided by the website after being negotiated with the website.
步骤S102、获取网站地图中页面的链接并进行访问。Step S102: Obtain a link of a page in the website map and perform access.
该步骤中,获取网站地图中各个页面的链接,该各个页面的链接为各URL(Uniform Resource Locator,统一资源定位符)链接,并对URL链接分别进行访问以进行验证。In this step, a link is obtained for each page in the website map, and the links of the respective pages are links of respective URLs (Uniform Resource Locators), and the URL links are separately accessed for verification.
步骤S103、根据访问结果删除网站地图中影响搜索收录的链接。Step S103: Delete a link in the website map that affects the search and inclusion according to the access result.
若网站地图中的链接存在无法访问的HTTP 404错误、访问时间过长或者其他问题,会影响搜索引擎对网站链接的收录。If the link in the site map has an inaccessible HTTP 404 error, too long access time, or other issues, it will affect the search engine's inclusion of the site link.
该步骤中,根据访问结果删除网站地图中影响搜索收录的链接包括:In this step, the links affecting the search inclusion in the website map are deleted according to the result of the visit, including:
在访问结果是出现无法访问的HTTP 404错误时,删除对应的链接;或,Delete the corresponding link when the result of the access is an unreachable HTTP 404 error; or,
在访问结果是页面响应时间大于或等于设定阈值时,删除对应的链接;或,When the result of the visit is that the page response time is greater than or equal to the set threshold, the corresponding link is deleted; or,
在访问结果是页面的标题、关键词和描述不完整时,删除对应的链接;或,When the result of the visit is that the title, keyword, and description of the page are incomplete, delete the corresponding link; or,
在访问结果是页面的正文内容与页面的标题、关键词和描述不匹配时,删除对应的链接。When the result of the visit is that the body content of the page does not match the title, keyword, and description of the page, the corresponding link is deleted.
当然,根据访问结果删除网站地图中影响搜索收录的链接的情况及方式并不作为本实施例中的限制,还可以包括其他。Of course, the case and manner of deleting the link affecting the search and inclusion in the website map according to the result of the access are not limited in this embodiment, and may include other.
步骤S104、生成新网站地图。Step S104: Generate a new website map.
该步骤中,在删除了网站地图中影响搜索收录的各链接后,重新整理生成新网站地图。 In this step, after deleting the links affecting the search and inclusion in the website map, rearrange and generate a new website map.
进一步的,在本实施例中,步骤S104之后,还可以包括:将生成的新网站地图提供给搜索引擎访问。Further, in this embodiment, after step S104, the method may further include: providing the generated new website map to the search engine for access.
并且,在将生成的新网站地图提供给搜索引擎访问后,还可以包括记录所述搜索引擎访问新网站地图后进行搜索并收录的收录数据。Moreover, after the generated new website map is provided to the search engine for accessing, the recorded data may be recorded and searched after the search engine accesses the new website map.
可以发现,本发明申请实施例的技术方案,通过获取网站地图中页面的链接后先进行访问,根据访问结果发现有影响搜索收录的链接后,删除网站地图中影响搜索收录的链接,再生成新网站地图,这样就可以实现对网站的原先的网站地图进行优化处理,尽量避免网站地图中出现各种内容不好或容易出错的链接,从而可以提升网站地图质量,也可以增加被搜索引擎收录的可能性,满足网站和搜索引擎的需求。It can be found that the technical solution of the embodiment of the present invention firstly accesses the link of the page in the website map, and after the link result is found to affect the search and the link included in the search, the link affecting the search and inclusion in the website map is deleted, and a new link is generated. Site map, so that you can optimize the original site map of the site, try to avoid the various links in the site map are not good or error-prone links, which can improve the quality of the site map, and can also be added by the search engine. Possibility to meet the needs of websites and search engines.
以下进一步更具体介绍本发明申请的技术方案。The technical solution of the present application is further described in more detail below.
图2是根据本发明申请的另一个实施例的处理网站地图的方法的示意性流程图。2 is a schematic flow chart of a method of processing a website map in accordance with another embodiment of the present application.
如图2所示,包括:As shown in Figure 2, it includes:
步骤S201、根据预设信息获取网站的网站地图。Step S201: Obtain a website map of the website according to the preset information.
该步骤参见上述步骤S101的描述。This step is described in the above description of step S101.
步骤S202、获取网站地图中页面的链接并进行访问。Step S202: Obtain a link of a page in the website map and perform an access.
该步骤参见上述步骤S102的描述。This step is described in the above description of step S102.
步骤S203、根据访问结果删除网站地图中影响搜索收录的链接。Step S203: Delete a link in the website map that affects the search and inclusion according to the access result.
该步骤参见上述步骤S103的描述。This step is described in the above description of step S103.
步骤S204、对访问的页面提取关键词和正文特征值。 Step S204, extracting keywords and text feature values from the accessed pages.
该步骤中,可利用现有的不同算法对页面的内容进行关键词提取,并对正文内容提取正文特征值,本发明申请不加以限定。In this step, the content of the page may be extracted by using different existing algorithms, and the feature value of the text is extracted from the content of the text. The application of the present invention is not limited.
步骤S205、根据提取的关键词和正文特征值与预存的关键词和正文特征值的比较结果,删除网站地图中影响搜索收录的链接。Step S205: Delete a link affecting the search and inclusion in the website map according to the comparison result between the extracted keyword and the text feature value and the pre-stored keyword and the text feature value.
该步骤中,将提取的关键词和正文特征值分别与预存的关键词和正文特征值进行比较,判断比较结果是否为一致。In this step, the extracted keyword and the text feature value are compared with the pre-stored keyword and the text feature value, respectively, to determine whether the comparison result is consistent.
具体的,将提取的关键词与预存的关键词进行比较,将提取的正文特征值与预存的正文特征值进行比较,当比较结果均为一致时,可以认为根据提取的关键词和正文特征值与预存的关键词和正文特征值的比较结果是一致,判断为内容重复提交,删除对应的链接。Specifically, the extracted keywords are compared with the pre-stored keywords, and the extracted text feature values are compared with the pre-stored text feature values. When the comparison results are all consistent, the extracted keywords and the body feature values may be considered. It is consistent with the comparison result of the pre-stored keyword and the text feature value, and it is judged that the content is repeatedly submitted, and the corresponding link is deleted.
步骤S206、生成新网站地图。Step S206, generating a new website map.
进一步的,在本实施例中,在步骤206的生成新网站地图后,还可以包括步骤S207、将生成的新网站地图提供给搜索引擎访问。Further, in this embodiment, after the new website map is generated in step 206, step S207 may be further included, and the generated new website map is provided to the search engine for access.
该步骤中,可以将生成的新网站地图替换网站原有的网站地图,供搜索引擎到网站访问新网站地图,也可以由网站进行设置,由搜索引擎直接到生成该新网站地图的服务平台访问新网站地图,本发明申请不加以限定,只要能让搜索引擎访问新网站地图即可。In this step, the generated new website map may be replaced with the original website map of the website for the search engine to visit the website to access the new website map, or may be set by the website, and the search engine directly accesses the service platform that generates the new website map. The new website map, the application of the present invention is not limited, as long as the search engine can access the new website map.
需说明的是,上述步骤S202、S203的处理与步骤S204、S205的处理没有必然的顺序关系,上述步骤安排仅为描述的方便。It should be noted that the processing of the above steps S202 and S203 has no necessary sequential relationship with the processing of steps S204 and S205, and the above-mentioned step arrangement is only for convenience of description.
需说明的是,上述步骤S207之后还可以包括:记录所述搜索引擎访问新网站地图后进行搜索并收录的收录数据。 It should be noted that, after the step S207, the method further includes: recording the collected data that is searched and included by the search engine after accessing the new website map.
可以发现,本发明申请实施例的技术方案,可以分别根据访问结果删除网站地图中影响搜索收录的链接和根据提取的关键词和正文特征值与预存的关键词和正文特征值的比较结果,删除网站地图中影响搜索收录的链接,提供优化效果。另外,还可以记录所述搜索引擎访问新网站地图后进行搜索并收录的收录数据,从而为后续的网站地图修改提供参考或供网站进行分析。It can be found that, in the technical solution of the embodiment of the present application, the link affecting the search and inclusion in the website map can be deleted according to the result of the access, and the comparison result between the extracted keyword and the text feature value and the pre-stored keyword and the text feature value can be deleted. The links in the site map that affect search inclusion provide optimization. In addition, the search engine may also record the collected data after the search engine accesses the new website map, thereby providing reference for the subsequent website map modification or for analyzing the website.
图3是根据本发明申请的再一个实施例的处理网站地图的方法的示意性流程图,在本实施例中,由sitemap服务平台对网站地图进行处理。FIG. 3 is a schematic flowchart of a method for processing a website map according to still another embodiment of the present application. In this embodiment, a website map is processed by a sitemap service platform.
具体的,如图3所示,包括:Specifically, as shown in FIG. 3, including:
步骤S301、sitemap服务平台根据网站的设置信息对网站的sitemap进行数据提取。Step S301: The sitemap service platform performs data extraction on the sitemap of the website according to the setting information of the website.
该步骤中,网站与sitemap服务平台(下文简称服务平台)预先协商一致,由网站设置sitemap与服务平台的映射关系,允许服务平台根据网站提供的设置信息例如地址信息对sitemap进行处理。网站设置映射关系可通过XML实现。服务平台根据网站提供的设置信息,可以对sitemap进行数据提取,获取其中的各链接的URL信息。In this step, the website and the sitemap service platform (hereinafter referred to as the service platform) are pre-negotiated, and the mapping relationship between the sitemap and the service platform is set by the website, and the service platform is allowed to process the sitemap according to the setting information provided by the website, for example, the address information. Website setting mapping relationships can be implemented in XML. The service platform can extract data from the sitemap according to the setting information provided by the website, and obtain the URL information of each link therein.
步骤S302、服务平台将提取的sitemap中的URL分别进行检查,判断访问URL是否出现无法访问的HTTP404错误,如果是,执行步骤S311,从sitemap中删除该URL并记录原因,如果否,执行步骤S303。Step S302: The service platform checks the URLs in the extracted sitemap to determine whether the access URL has an HTTP404 error that cannot be accessed. If yes, step S311 is performed to delete the URL from the sitemap and record the reason. If not, step S303 is performed. .
HTTP404错误意味着链接指向的网页不存在,即原始网页的URL失效,这种情况经常会发生,例如:网页URL生成规则改变、网页文件更名或移动位置、导入链接拼写错误等,导致 原来的URL地址无法访问;当网页服务器接到类似请求时,会返回一个404状态码,告诉浏览器要请求的资源并不存在。因此,当出现URL无法访问的HTTP404错误时,表示该URL已经失效,此时从sitemap中删除该URL并记录原因。The HTTP404 error means that the webpage pointed to by the link does not exist, that is, the URL of the original webpage is invalid. This often happens, for example, the URL URL generation rule is changed, the webpage file is renamed or moved, and the import link is misspelled. The original URL address cannot be accessed; when the web server receives a similar request, it will return a 404 status code, telling the browser that the requested resource does not exist. Therefore, when an HTTP404 error that the URL cannot access is displayed, it indicates that the URL has expired. At this time, the URL is deleted from the sitemap and the reason is recorded.
步骤S303、服务平台判断访问URL的页面响应速度是否异常,如果是,执行步骤S311,从sitemap中删除该URL并记录原因,如果否,执行步骤S304。Step S303: The service platform determines whether the page response speed of the access URL is abnormal. If yes, step S311 is performed to delete the URL from the sitemap and record the reason. If not, step S304 is performed.
当URL可以正常访问时,检测页面的响应速度,响应速度可以通过响应时间进行衡量。如果响应时间大于或等于设定阈值,认为响应速度异常,如果小于设定阈值,认为响应速度正常。设定阈值,可以根据经验取值,例如设置为500毫秒或1秒,本发明申请不加以限定。When the URL can be accessed normally, the response speed of the page is detected, and the response speed can be measured by the response time. If the response time is greater than or equal to the set threshold, the response speed is considered abnormal, and if it is less than the set threshold, the response speed is considered normal. The threshold value can be set according to experience, for example, set to 500 milliseconds or 1 second, which is not limited by the application of the present invention.
需说明的时,也可以根据页面历史访问响应速度与当前访问响应速度进行对比,判断响应速度是否异常。如果当前响应时间比历史响应时间大很多,其差值超过某个阈值,可认为响应速度异常。When it is necessary to explain, the page history access response speed can be compared with the current access response speed to determine whether the response speed is abnormal. If the current response time is much larger than the historical response time and the difference exceeds a certain threshold, the response speed is considered abnormal.
因此,当页面响应速度异常时,表示该URL对应的页面可能有问题或URL对应的网络连接可能有问题,这些都会影响用户的浏览体验,此时从sitemap中删除该URL并记录原因。Therefore, when the page response speed is abnormal, it indicates that the page corresponding to the URL may have a problem or the network connection corresponding to the URL may have a problem, which may affect the browsing experience of the user. At this time, the URL is deleted from the sitemap and the reason is recorded.
步骤S304、服务平台判断页面的TKD是否不完整,如果是,执行步骤S311,从sitemap中删除该URL并记录原因,如果否,执行步骤S305。Step S304: The service platform determines whether the TKD of the page is incomplete. If yes, step S311 is performed to delete the URL from the sitemap and record the reason. If not, step S305 is performed.
TKD是标题title、关键词keywords、描述description的缩写。TKD的格式内容可以如下所示:TKD is an abbreviation for title title, keyword keywords, description description. The format of TKD can be as follows:
<title>这里是标题内容</title> <title>This is the title content</title>
<meta name="keywords"content="这里是关键词内容"/><meta name="keywords"content="here is the keyword content"/>
<meta name="description"content="这里是描述内容"/><meta name="description"content="here is the description content"/>
关键词keywords,是一个网站管理者给网站某个页面设定的以便用户通过搜索引擎能搜到本网页的词汇,关键词代表了网站的市场定位。描述description,也可称为“内容标签”,“描述标签”或“内容摘要”,反映网页的主要内容。The keyword keyword is a vocabulary that a website administrator sets for a certain page of the website so that the user can search the webpage through the search engine, and the keyword represents the market positioning of the website. Description description, also known as "content label", "description label" or "content summary", reflecting the main content of the web page.
一般是完整的TKD才符合搜索引擎的搜索规则,如果TKD不完整,不符合搜索引擎的搜索规则,那么搜索引擎可能不会搜索该页面,或不收录该链接内容。因此,发现TKD不完整时从sitemap中删除该URL并记录原因。Generally, the complete TKD is in line with the search engine search rules. If the TKD is incomplete and does not conform to the search engine's search rules, the search engine may not search the page or include the link content. Therefore, it is found that the URL is deleted from the sitemap when the TKD is incomplete and the reason is recorded.
步骤S305、服务平台判断页面正文内容是否与TKD不匹配,如果是,执行步骤S311,从sitemap中删除该URL并记录原因,如果否,执行步骤S306。Step S305: The service platform determines whether the content of the page body does not match the TKD. If yes, step S311 is performed to delete the URL from the sitemap and record the reason. If not, step S306 is performed.
该步骤中,根据页面里的正文内容,判断正文里是否出现TKD中的关键词,正文的内容是否与TKD的标题和描述对应,如果出现TKD中的关键词,正文的内容是与TKD的标题和描述是对应的,表示是匹配的,否则是不匹配的。如果是不匹配,那么可能是正文设置错了,或者是TKD设置错了,这些都会影响搜索引擎的搜索质量和影响用户的浏览体验。因此,发现页面正文内容是否与TKD不匹配时从sitemap中删除该URL并记录原因。In this step, according to the body content in the page, it is judged whether the keyword in the TKD appears in the body text, and whether the content of the body text corresponds to the title and description of the TKD. If the keyword in the TKD appears, the content of the body text is the title of the TKD. Corresponding to the description, indicating that it is a match, otherwise it does not match. If it doesn't match, it may be that the body is set incorrectly, or the TKD setting is wrong. These will affect the search engine's search quality and affect the user's browsing experience. Therefore, if the content of the page body is found to be mismatched with TKD, the URL is deleted from the sitemap and the reason is recorded.
步骤S306、服务平台对页面的内容进行关键词提取,并对正文内容提取正文特征值。 Step S306: The service platform performs keyword extraction on the content of the page, and extracts a text feature value for the body content.
该步骤中,服务平台可利用现有的不同算法对页面的内容进行关键词提取,并对正文内容提取正文特征值,本发明申请不加以限定。In this step, the service platform may perform keyword extraction on the content of the page by using different existing algorithms, and extract the feature value of the body text, which is not limited by the application.
例如,关键词提取可采用现有的TFIDF(term frequency–inverse document frequency,词频--反转文件频率)算法,该算法主要是用一个字典来保存所有的词信息,然后按值value对字典排序,最后取权重排名靠前的几个词作为关键词。例如,对正文内容提取正文特征值,可以采用基于语境框架的文本特征提取方法或基于本体论的文本特征提取方法等。For example, the keyword extraction can use the existing TFIDF (term frequency-inverse document frequency) algorithm, which mainly uses a dictionary to save all word information, and then sorts the dictionary by value value. Finally, the top words with the weights are ranked as keywords. For example, the text feature extraction method based on the context framework or the text feature extraction method based on the ontology may be adopted for extracting the text feature value from the body content.
步骤S307、服务平台将提取的关键词和正文特征值与服务平台存储的关键词和正文特征值进行比较,检查是否存在重复提交内容的情况,如果是,执行步骤S311,从sitemap中删除该URL并记录原因,如果否,执行步骤S308。Step S307: The service platform compares the extracted keyword and the text feature value with the keyword and the text feature value stored by the service platform, and checks whether there is a case of repeatedly submitting the content. If yes, step S311 is performed to delete the URL from the sitemap. And the reason is recorded, and if no, step S308 is performed.
该步骤通过将提取的关键词和正文特征值与服务平台存储的关键词和正文特征值进行比较,来进行正文相关度的匹配,如果在服务平台找到了同样的关键词和正文特征值,判断为内容重复。通过该匹配检测,从而可以检查是否存在重复提交内容的情况。在服务平台,预先存储检测过的各页面的关键词和正文特征值。In this step, the extracted keyword and the text feature value are compared with the keyword and the text feature value stored by the service platform, and the body relevance is matched. If the same keyword and body feature value are found on the service platform, the judgment is performed. Repeat for content. By this matching detection, it is possible to check whether there is a case where the content is repeatedly submitted. On the service platform, the detected keyword and body feature values of each page are stored in advance.
步骤S308,服务平台将提取的关键词、正文特征值及对应的链接进行保存,供后续查重所用。In step S308, the service platform saves the extracted keywords, text feature values, and corresponding links for use in subsequent checksums.
步骤S309、服务平台生成经过处理后的新的sitemap数据供搜索引擎获取。Step S309: The service platform generates the processed new sitemap data for the search engine to acquire.
该步骤中,可以在网站进行设置,指示搜索引擎直接到服务平台获取sitemap,或者,服务平台可以直接将新的sitemap替换网站的原来的sitemap。 In this step, the website may be set to instruct the search engine to directly obtain the sitemap from the service platform, or the service platform may directly replace the new sitemap with the original sitemap of the website.
步骤S310、服务平台对最新的sitemap数据进行收录情况监控。Step S310: The service platform monitors the latest sitemap data.
如果sitemap的URL被搜索引擎收录,会返回标记信息,服务平台监控URL被搜索引擎收录的情况,可以为后续调整sitemap提供参考。If the URL of the sitemap is included in the search engine, the tag information will be returned, and the service platform monitoring URL is included in the search engine, which can provide reference for subsequent adjustment of the sitemap.
步骤S311、服务平台从sitemap中删除该链接,并记录原因供网站进行分析。Step S311: The service platform deletes the link from the sitemap, and records the reason for the website to analyze.
该步骤中,可详细记录该链接被删除的原因,供网站进行分析。In this step, the reason why the link is deleted can be recorded in detail for analysis by the website.
可以发现,本发明申请实施例的技术方案,对获取的网站的sitemap数据进行分析过滤,并对sitemap提供的链接进行访问验证,另外还对正文内容进行关键词提取和正文特征值提取,并与预存的关键词和正文特征值进行匹配,从而避免提交重复内容或者质量不好的内容。最后还可以对搜索引擎对sitemap的收录情况进行监控。通过上述处理,本发明申请可以优化sitemap的质量,提升网站内容被搜索引擎收录的收录数量,让搜索引擎更好的收录网站的页面,也改善了重复内容、垃圾内容提交到搜索引擎导致的搜索降权的问题,还可以更好的监控网站内容的情况。It can be found that the technical solution of the embodiment of the present invention analyzes and filters the sitemap data of the obtained website, and performs access verification on the link provided by the sitemap, and further extracts the keyword content and extracts the feature value of the body content, and Pre-stored keywords and body feature values are matched to avoid submitting duplicate content or poor quality content. Finally, the search engine can monitor the inclusion of the sitemap. Through the above processing, the application of the present invention can optimize the quality of the sitemap, improve the content of the website content collected by the search engine, enable the search engine to better include the website page, and improve the search caused by the duplicate content and the spam submission to the search engine. The issue of power reduction can also better monitor the content of the website.
上述详细介绍了本发明申请实施例的处理网站地图的方法,相应的,本发明申请实施例还提供一种处理网站地图的装置。The method for processing a website map according to the embodiment of the present invention is described in detail. Correspondingly, the embodiment of the present invention further provides an apparatus for processing a website map.
图4是本发明申请实施例的一种处理网站地图的装置的示意性方框图。4 is a schematic block diagram of an apparatus for processing a website map according to an embodiment of the present application.
如图4所示,一种处理网站地图的装置,包括:获取模块401、访问模块402、第一处理模块403以及生成模块404。本发明申请实施例的处理网站地图的装置,可以是服务平台或其他。 As shown in FIG. 4, an apparatus for processing a website map includes: an obtaining module 401, an access module 402, a first processing module 403, and a generating module 404. The device for processing a website map according to an embodiment of the present application may be a service platform or the like.
获取模块401,用于根据预设信息获取网站的网站地图。The obtaining module 401 is configured to obtain a website map of the website according to the preset information.
装置可以根据与网站协商一致后,由获取模块401根据网站提供的设置信息,获取网站的网站地图。After the device negotiates with the website, the obtaining module 401 obtains the website map of the website according to the setting information provided by the website.
访问模块402,用于根据所述获取模块401获取的网站地图,获取网站地图中页面的链接并进行访问。The access module 402 is configured to obtain a link of the page in the website map and access according to the website map acquired by the obtaining module 401.
访问模块402获取网站地图中的各URL链接,并对URL链接分别进行访问以进行验证。The access module 402 obtains each URL link in the website map and separately accesses the URL link for verification.
第一处理模块403,用于根据所述访问模块402的访问结果删除网站地图中影响搜索收录的链接。The first processing module 403 is configured to delete a link in the website map that affects the search and inclusion according to the access result of the access module 402.
第一处理模块403根据各种不同访问结果删除网站地图中影响搜索收录的链接。The first processing module 403 deletes the links in the website map that affect the search listing according to various different access results.
生成模块404,用于在所述第一处理模块403进行处理后生成新网站地图。The generating module 404 is configured to generate a new website map after the processing by the first processing module 403.
图5是本发明申请实施例的一种处理网站地图的装置的另一示意性方框图。FIG. 5 is another schematic block diagram of an apparatus for processing a website map according to an embodiment of the present application.
如图5所示,一种处理网站地图的装置,包括:获取模块401、访问模块402、第一处理模块403以及生成模块404,各模块的功能参见图4所述。As shown in FIG. 5, an apparatus for processing a website map includes: an obtaining module 401, an access module 402, a first processing module 403, and a generating module 404. The functions of each module are described in FIG.
此外,所述装置还包括:第二处理模块405。In addition, the device further includes: a second processing module 405.
第二处理模块405,用于对访问的页面提取关键词和正文特征值,根据提取的关键词和正文特征值与预存的关键词和正文特征值的比较结果,删除网站地图中影响搜索收录的链接;所述生成模块404用于在所述第一处理模块403和所述第二处理模块405进行处理后,生成新网站地图。 The second processing module 405 is configured to extract a keyword and a text feature value from the accessed page, and delete a search result included in the website map according to the comparison result between the extracted keyword and the text feature value and the pre-stored keyword and the text feature value. The generating module 404 is configured to generate a new website map after the first processing module 403 and the second processing module 405 perform processing.
第二处理模块405是根据提取的关键词和正文特征值与预存的关键词和正文特征值的比较结果是一致,判断为内容重复提交,删除对应的链接。The second processing module 405 determines that the content is repeatedly submitted and deletes the corresponding link according to the comparison result between the extracted keyword and the text feature value and the pre-stored keyword and the text feature value.
所述装置还包括:输出模块406。The device also includes an output module 406.
输出模块406,用于将所述生成模块生成的新网站地图提供给搜索引擎访问。The output module 406 is configured to provide a new website map generated by the generating module to the search engine for access.
本发明申请实施例可以将生成的新网站地图替换网站原有的网站地图,供搜索引擎到网站访问新网站地图,也可以由网站进行设置,由搜索引擎直接到服务平台访问新网站地图,本发明申请实施例不加以限定,只要能让搜索引擎访问新网站地图即可。The embodiment of the present invention can replace the generated new website map with the original website map of the website, and the search engine can access the new website map by the website, or can be set by the website, and the search engine directly accesses the service platform to access the new website map. The embodiment of the invention is not limited as long as the search engine can access the new website map.
所述装置还包括:监控模块407。The device also includes a monitoring module 407.
监控模块407,用于记录所述搜索引擎访问新网站地图后进行搜索并收录的收录数据。The monitoring module 407 is configured to record the collected data that is searched and included by the search engine after accessing the new website map.
其中,所述第一处理模块403包括:第一删除单元4031、第二删除单元4032、第三删除单元4033或第四删除单元4034。The first processing module 403 includes: a first deleting unit 4031, a second deleting unit 4032, a third deleting unit 4033, or a fourth deleting unit 4034.
第一删除单元4031,用于在访问结果是出现无法访问的HTTP 404错误时,删除对应的链接。The first deleting unit 4031 is configured to delete the corresponding link when the access result is an HTTP 404 error that cannot be accessed.
第二删除单元4032,用于在访问结果是页面响应时间大于或等于设定阈值时,删除对应的链接。The second deleting unit 4032 is configured to delete the corresponding link when the access result is that the page response time is greater than or equal to the set threshold.
第三删除单元4033,用于在访问结果是页面的标题、关键词和描述不完整时,删除对应的链接。The third deleting unit 4033 is configured to delete the corresponding link when the access result is that the title, the keyword, and the description of the page are incomplete.
第四删除单元4034,用于在访问结果是页面的正文内容与页面的标题、关键词和描述不匹配时,删除对应的链接。The fourth deleting unit 4034 is configured to delete the corresponding link when the access result is that the body content of the page does not match the title, keyword, and description of the page.
进一步的,本发明申请实施例还提供一种了处理设备。 Further, the embodiment of the present application further provides a processing device.
图6是本发明申请一个实施例的一种处理设备600的示意性方框图。Figure 6 is a schematic block diagram of a processing device 600 in accordance with one embodiment of the present application.
如图6所示,本发明申请实施例提供的处理设备600包括:存储器601和处理器602。As shown in FIG. 6, the processing device 600 provided by the embodiment of the present application includes: a memory 601 and a processor 602.
存储器601用于存储程序。The memory 601 is used to store programs.
处理器602用于通过调用所述存储器601中存储的程序,执行包括以下步骤的方法:The processor 602 is configured to execute a method including the following steps by calling a program stored in the memory 601:
根据预设信息获取网站的网站地图;Obtain a website map of the website according to preset information;
获取网站地图中页面的链接并进行访问;Get a link to the page in the site map and access it;
根据访问结果删除网站地图中影响搜索收录的链接;According to the result of the visit, delete the link in the website map that affects the search and inclusion;
生成新网站地图。Generate a new sitemap.
需说明的是,关于上述方法的进一步描述具体参见前面方法流程中的详细描述,此处不再赘述。It should be noted that, for further description of the foregoing method, refer to the detailed description in the foregoing method flow, and details are not described herein again.
在本实施例中,处理器602通过运行存储在存储器601中的上述程序,从而执行各种功能应用以及数据处理,即实现本申请实施例中的处理网站地图的方法。存储器601可以包括但不限于随机存取存储器(Random Access Memory,RAM),只读存储器(Read Only Memory,ROM),可编程只读存储器(Programmable Read-Only Memory,PROM),可擦除只读存储器(Erasable Programmable Read-Only Memory,EPROM),电可擦除只读存储器(Electric Erasable Programmable Read-Only Memory,EEPROM)等。其中,所述处理器602可以在接收到执行指令后,执行所述存储器601中存储的上述程序,相应地实现前述本发明申请实施例任一实施例揭示的流程所定义的方法。 In this embodiment, the processor 602 executes various functions and data processing by executing the above-mentioned program stored in the memory 601, that is, the method for processing the website map in the embodiment of the present application. The memory 601 can include, but is not limited to, a random access memory (RAM), a read only memory (ROM), a programmable read only memory (PROM), and an erasable read only. Erasable Programmable Read-Only Memory (EPROM), Electric Erasable Programmable Read-Only Memory (EEPROM), and the like. The processor 602 can execute the foregoing method stored in the memory 601 after receiving the execution instruction, and implement the method defined in the flow disclosed in any embodiment of the present application.
处理器602可以是一种集成电路芯片,具有信号处理能力。上述处理器可以是通用处理器,包括中央处理器(Central Processing Unit,简称CPU)、网络处理器(Network Processor,简称NP)等;还可以是数字信号处理器(DSP)、专用集成电路(ASIC)、现成可编程门阵列(FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件。其可以实现或者执行本申请实施例中的公开的各方法、步骤及逻辑框图。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。Processor 602 can be an integrated circuit chip with signal processing capabilities. The processor may be a general-purpose processor, including a central processing unit (CPU), a network processor (NP processor, etc.), or a digital signal processor (DSP) or an application specific integrated circuit (ASIC). ), off-the-shelf programmable gate arrays (FPGAs) or other programmable logic devices, discrete gates or transistor logic devices, discrete hardware components. The methods, steps, and logical block diagrams disclosed in the embodiments of the present application can be implemented or executed. The general purpose processor may be a microprocessor or the processor or any conventional processor or the like.
可以理解的,图6所示的结构仅为示意,处理设备600还可以包括比图6中所示更多或者更少的组件,或者具有与图6所示不同的配置。图6中所示的各组件可以采用硬件、软件或其组合实现。It can be understood that the structure shown in FIG. 6 is merely illustrative, and the processing device 600 may further include more or less components than those shown in FIG. 6, or have a different configuration from that shown in FIG. 6. The components shown in Figure 6 can be implemented in hardware, software, or a combination thereof.
前述实施例中的处理网站地图的装置的各模块和单元可以是由软件代码实现,此时,上述的各模块和单元可存储于处理设备600的存储器601内。以上各模块和单元同样可以由硬件例如集成电路芯片实现。The modules and units of the apparatus for processing a website map in the foregoing embodiments may be implemented by software code. In this case, the modules and units described above may be stored in the memory 601 of the processing device 600. The above modules and units can also be implemented by hardware such as an integrated circuit chip.
综上所述,本发明申请实施例的技术方案,对获取的网站的sitemap数据进行分析过滤,对sitemap提供的链接进行访问验证,另外还对正文内容进行关键词提取和正文特征值提取,并与预存的关键词和正文特征值进行匹配,从而避免提交重复内容或者质量不好的内容。最后还可以对搜索引擎对sitemap的收录情况进行监控。通过上述处理,本发明申请可以优化sitemap的质量,提升网站内容被搜索引擎收录的收录数量,让搜索引擎更好的收录网站的页面,也解决了重复内容、垃圾内容提交到搜索引擎导致的搜索降权的问题,还可以更好的监控网站内容的情况。 In summary, the technical solution of the embodiment of the present invention analyzes and filters the sitemap data of the obtained website, performs access verification on the link provided by the sitemap, and performs keyword extraction and text feature value extraction on the body content, and Matches pre-stored keyword and body feature values to avoid submitting duplicate content or poor quality content. Finally, the search engine can monitor the inclusion of the sitemap. Through the above processing, the application of the present invention can optimize the quality of the sitemap, improve the content of the website content collected by the search engine, and enable the search engine to better include the website page, and also solve the search caused by the duplicate content and the spam submission to the search engine. The issue of power reduction can also better monitor the content of the website.
上文中已经参考附图详细描述了根据本发明申请的技术方案。The technical solution according to the application of the present invention has been described in detail above with reference to the accompanying drawings.
此外,根据本发明申请实施例的上述方法还可以实现为一种计算机程序,该计算机程序包括用于在被处理器执行时使该处理器执行本发明申请上述实施例提供的方法的计算机程序代码指令。或者,根据本发明申请的方法还可以实现为一种计算机程序产品,该计算机程序产品包括具有处理器可执行的非易失的程序代码的计算机可读介质,所述程序代码在被所述处理器执行时用于执行本发明申请上述方法。本领域技术人员还将明白的是,结合这里的公开所描述的各种示例性逻辑块、模块、电路和算法步骤可以被实现为电子硬件、计算机软件或两者的组合。Furthermore, the above method according to an embodiment of the present application may also be implemented as a computer program comprising computer program code for causing the processor to execute the method provided by the above embodiment of the present application when executed by the processor instruction. Alternatively, the method according to the present application may also be embodied as a computer program product comprising a computer readable medium having processor-executable non-volatile program code, the program code being processed as described The apparatus is executed to perform the above method of the present application. The various illustrative logical blocks, modules, circuits, and algorithm steps described in connection with the disclosure herein may be implemented as electronic hardware, computer software, or combinations of both.
附图中的流程图和框图显示了根据本发明申请的多个实施例的系统和方法的可能实现的体系架构、功能和操作。在这点上,流程图或框图中的每个方框可以代表一个模块、程序段或代码的一部分,所述模块、程序段或代码的一部分包含一个或多个用于实现规定的逻辑功能的可执行指令。也应当注意,在有些作为替换的实现中,方框中所标记的功能也可以以不同于附图中所标记的顺序发生。例如,两个连续的方框实际上可以基本并行地执行,它们有时也可以按相反的顺序执行,这依所涉及的功能而定。也要注意的是,框图和/或流程图中的每个方框、以及框图和/或流程图中的方框的组合,可以用执行规定的功能或操作的专用的基于硬件的系统来实现,或者可以用专用硬件与计算机指令的组合来实现。The flowchart and block diagrams in the Figures illustrate the architecture, functionality, and operation of possible implementations of systems and methods in accordance with various embodiments of the present application. In this regard, each block of the flowchart or block diagram can represent a module, a program segment, or a portion of code that includes one or more of the Executable instructions. It should also be noted that in some alternative implementations, the functions noted in the blocks may also occur in a different order than the ones in the drawings. For example, two consecutive blocks may be executed substantially in parallel, and they may sometimes be executed in the reverse order, depending upon the functionality involved. It is also noted that each block of the block diagrams and/or flowcharts, and combinations of blocks in the block diagrams and/or flowcharts, can be implemented in a dedicated hardware-based system that performs the specified function or operation. Or it can be implemented by a combination of dedicated hardware and computer instructions.
以上已经描述了本发明申请的各实施例,上述说明是示例性的,并非穷尽性的,并且也不限于所披露的各实施例。在不偏离 所说明的各实施例的范围和精神的情况下,对于本技术领域的普通技术人员来说许多修改和变更都是显而易见的。本文中所用术语的选择,旨在最好地解释各实施例的原理、实际应用或对市场中的技术的改进,或者使本技术领域的其它普通技术人员能理解本文披露的各实施例。 The embodiments of the present invention have been described above, and the foregoing description is illustrative, not limiting, and not limited to the disclosed embodiments. Without deviation Numerous modifications and changes will be apparent to those of ordinary skill in the art. The choice of terms used herein is intended to best explain the principles, practical applications, or improvements of the techniques in the various embodiments of the embodiments, or to enable those of ordinary skill in the art to understand the embodiments disclosed herein.

Claims (12)

  1. 一种处理网站地图的方法,其特征在于,包括:A method for processing a website map, comprising:
    根据预设信息获取网站的网站地图;Obtain a website map of the website according to preset information;
    获取网站地图中页面的链接并进行访问;Get a link to the page in the site map and access it;
    根据访问结果删除网站地图中影响搜索收录的链接;According to the result of the visit, delete the link in the website map that affects the search and inclusion;
    生成新网站地图。Generate a new sitemap.
  2. 根据权利要求1所述的方法,其特征在于,所述获取网站地图中页面的链接并进行访问之后还包括:The method according to claim 1, wherein the obtaining a link of the page in the website map and performing the access further comprises:
    对访问的页面提取关键词和正文特征值;Extracting keyword and body feature values for the accessed page;
    根据提取的关键词和正文特征值与预存的关键词和正文特征值的比较结果,删除网站地图中影响搜索收录的链接。According to the comparison result of the extracted keyword and the text feature value and the pre-stored keyword and the text feature value, the link affecting the search index in the website map is deleted.
  3. 根据权利要求2所述的方法,其特征在于,所述根据提取的关键词和正文特征值与预存的关键词和正文特征值的比较结果,删除网站地图中影响搜索收录的链接包括:The method according to claim 2, wherein the deleting the link affecting the search in the website map according to the comparison result of the extracted keyword and the text feature value and the pre-stored keyword and the text feature value comprises:
    根据提取的关键词和正文特征值与预存的关键词和正文特征值的比较结果是一致,判断为内容重复提交,删除对应的链接。According to the comparison result of the extracted keyword and the text feature value and the pre-stored keyword and the text feature value, it is determined that the content is repeatedly submitted, and the corresponding link is deleted.
  4. 根据权利要求1所述的方法,其特征在于,所述根据访问结果删除网站地图中影响搜索收录的链接包括:The method according to claim 1, wherein the deleting the link affecting the search inclusion in the website map according to the result of the access comprises:
    在访问结果是出现无法访问的HTTP 404错误时,删除对应的链接;或,Delete the corresponding link when the result of the access is an unreachable HTTP 404 error; or,
    在访问结果是页面响应时间大于或等于设定阈值时,删除对应的链接;或, When the result of the visit is that the page response time is greater than or equal to the set threshold, the corresponding link is deleted; or,
    在访问结果是页面的标题、关键词和描述不完整时,删除对应的链接;或,When the result of the visit is that the title, keyword, and description of the page are incomplete, delete the corresponding link; or,
    在访问结果是页面的正文内容与页面的标题、关键词和描述不匹配时,删除对应的链接。When the result of the visit is that the body content of the page does not match the title, keyword, and description of the page, the corresponding link is deleted.
  5. 根据权利要求1所述的方法,其特征在于,所述方法还包括:将生成的新网站地图提供给搜索引擎访问。The method of claim 1 further comprising: providing the generated new website map to a search engine for access.
  6. 根据权利要求5所述的方法,其特征在于,所述方法还包括:The method of claim 5, wherein the method further comprises:
    记录所述搜索引擎访问新网站地图后进行搜索并收录的收录数据。Recording the search data that the search engine searches for and records after accessing the new website map.
  7. 一种处理网站地图的装置,其特征在于,包括:An apparatus for processing a website map, comprising:
    获取模块,用于根据预设信息获取网站的网站地图;An obtaining module, configured to obtain a website map of the website according to the preset information;
    访问模块,用于根据所述获取模块获取的网站地图,获取网站地图中页面的链接并进行访问;An access module, configured to obtain a link of a page in the website map according to the website map acquired by the obtaining module, and access the website;
    第一处理模块,用于根据所述访问模块的访问结果删除网站地图中影响搜索收录的链接;a first processing module, configured to delete, according to the access result of the access module, a link in the website map that affects the search and inclusion;
    生成模块,用于在所述第一处理模块进行处理后生成新网站地图。And a generating module, configured to generate a new website map after the processing by the first processing module.
  8. 根据权利要求7所述的装置,其特征在于,所述装置还包括:The device according to claim 7, wherein the device further comprises:
    第二处理模块,用于对访问的页面提取关键词和正文特征值,根据提取的关键词和正文特征值与预存的关键词和正文特征值的比较结果,删除网站地图中影响搜索收录的链接;a second processing module, configured to extract a keyword and a text feature value from the accessed page, and delete a link affecting the search index in the website map according to the comparison result between the extracted keyword and the text feature value and the pre-stored keyword and the text feature value ;
    所述生成模块用于在所述第一处理模块和所述第二处理模块进行处理后,生成新网站地图。 The generating module is configured to generate a new website map after the first processing module and the second processing module perform processing.
  9. 根据权利要求7所述的装置,其特征在于,所述装置还包括:The device according to claim 7, wherein the device further comprises:
    输出模块,用于将所述生成模块生成的新网站地图提供给搜索引擎访问。And an output module, configured to provide the new website map generated by the generating module to the search engine for access.
  10. 根据权利要求9所述的装置,其特征在于,所述装置还包括:The device according to claim 9, wherein the device further comprises:
    监控模块,用于记录所述搜索引擎访问新网站地图后进行搜索并收录的收录数据。The monitoring module is configured to record the collected data that is searched and included by the search engine after accessing the new website map.
  11. 根据权利要求7所述的装置,其特征在于,所述第一处理模块包括:The apparatus according to claim 7, wherein the first processing module comprises:
    第一删除单元,用于在访问结果是出现无法访问的HTTP404错误时,删除对应的链接;或,The first deleting unit is configured to delete the corresponding link when the access result is an HTTP404 error that cannot be accessed; or
    第二删除单元,用于在访问结果是页面响应时间大于或等于设定阈值时,删除对应的链接;或,a second deleting unit, configured to delete the corresponding link when the access result is that the page response time is greater than or equal to the set threshold; or
    第三删除单元,用于在访问结果是页面的标题、关键词和描述不完整时,删除对应的链接;或,a third deleting unit, configured to delete the corresponding link when the access result is that the title, keyword, and description of the page are incomplete; or
    第四删除单元,用于在访问结果是页面的正文内容与页面的标题、关键词和描述不匹配时,删除对应的链接。And a fourth deleting unit, configured to delete the corresponding link when the access result is that the body content of the page does not match the title, keyword, and description of the page.
  12. 一种处理设备,其特征在于,包括:A processing device, comprising:
    存储器,用于存储程序,Memory for storing programs,
    处理器,用于通过调用所述存储器中存储的程序,执行包括以下步骤的方法:And a processor, configured to execute a method including the following steps by calling a program stored in the memory:
    根据预设信息获取网站的网站地图;Obtain a website map of the website according to preset information;
    获取网站地图中页面的链接并进行访问;Get a link to the page in the site map and access it;
    根据访问结果删除网站地图中影响搜索收录的链接; According to the result of the visit, delete the link in the website map that affects the search and inclusion;
    生成新网站地图。 Generate a new sitemap.
PCT/CN2016/102215 2015-10-16 2016-10-14 Method, apparatus and device for processing sitemap WO2017063596A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201510676894.0 2015-10-16
CN201510676894.0A CN105260469B (en) 2015-10-16 2015-10-16 A kind of method, apparatus and equipment for handling site maps

Publications (1)

Publication Number Publication Date
WO2017063596A1 true WO2017063596A1 (en) 2017-04-20

Family

ID=55100159

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2016/102215 WO2017063596A1 (en) 2015-10-16 2016-10-14 Method, apparatus and device for processing sitemap

Country Status (2)

Country Link
CN (1) CN105260469B (en)
WO (1) WO2017063596A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111695056A (en) * 2019-03-12 2020-09-22 阿里巴巴集团控股有限公司 Page processing method, page return processing method, device and equipment

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105260469B (en) * 2015-10-16 2017-12-26 广州神马移动信息科技有限公司 A kind of method, apparatus and equipment for handling site maps
CN106095674B (en) * 2016-06-07 2019-05-24 百度在线网络技术(北京)有限公司 A kind of website automation test method and device
CN107807937B (en) * 2016-09-09 2021-11-30 阿里巴巴集团控股有限公司 Website SEO processing method, device and system
CN108255831B (en) * 2016-12-28 2021-12-17 航天信息股份有限公司 Method and system for generating website map for website
CN112307395A (en) * 2020-08-10 2021-02-02 北京沃东天骏信息技术有限公司 Method and device for generating website map

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1486457A (en) * 2000-11-21 2004-03-31 ��ķɭ��ɹ�˾ System and process for mediated crawling
CN102057372A (en) * 2008-04-17 2011-05-11 谷歌公司 Generating sitemaps
US8290928B1 (en) * 2008-02-21 2012-10-16 Google Inc. Generating sitemap where last modified time is not available to a network crawler
US20130226898A1 (en) * 2005-05-31 2013-08-29 Google Inc. Web Crawler Scheduler that Utilizes Sitemaps from Websites
CN104317938A (en) * 2014-10-31 2015-01-28 北京国双科技有限公司 Webpage validation method and device
CN105260469A (en) * 2015-10-16 2016-01-20 广州神马移动信息科技有限公司 Sitemap processing method, apparatus and device

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8126869B2 (en) * 2008-02-08 2012-02-28 Microsoft Corporation Automated client sitemap generation

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1486457A (en) * 2000-11-21 2004-03-31 ��ķɭ��ɹ�˾ System and process for mediated crawling
US20130226898A1 (en) * 2005-05-31 2013-08-29 Google Inc. Web Crawler Scheduler that Utilizes Sitemaps from Websites
US8290928B1 (en) * 2008-02-21 2012-10-16 Google Inc. Generating sitemap where last modified time is not available to a network crawler
CN102057372A (en) * 2008-04-17 2011-05-11 谷歌公司 Generating sitemaps
CN104317938A (en) * 2014-10-31 2015-01-28 北京国双科技有限公司 Webpage validation method and device
CN105260469A (en) * 2015-10-16 2016-01-20 广州神马移动信息科技有限公司 Sitemap processing method, apparatus and device

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111695056A (en) * 2019-03-12 2020-09-22 阿里巴巴集团控股有限公司 Page processing method, page return processing method, device and equipment
CN111695056B (en) * 2019-03-12 2024-03-22 阿里巴巴集团控股有限公司 Page processing and page return processing methods, devices and equipment

Also Published As

Publication number Publication date
CN105260469A (en) 2016-01-20
CN105260469B (en) 2017-12-26

Similar Documents

Publication Publication Date Title
WO2017063596A1 (en) Method, apparatus and device for processing sitemap
US9081861B2 (en) Uniform resource locator canonicalization
US10169449B2 (en) Method, apparatus, and server for acquiring recommended topic
US9275115B2 (en) Correlating corpus/corpora value from answered questions
US8341150B1 (en) Filtering search results using annotations
US20130103669A1 (en) Search Engine Indexing
US20150074289A1 (en) Detecting error pages by analyzing server redirects
US9081953B2 (en) Defense against search engine tracking
US10467640B2 (en) Collecting and analyzing electronic survey responses including user-composed text
WO2013044744A1 (en) Download resource providing method and device
US8423885B1 (en) Updating search engine document index based on calculated age of changed portions in a document
US20100125781A1 (en) Page generation by keyword
EP2457185A2 (en) Method and system for document indexing and data querying
CN109743309B (en) Illegal request identification method and device and electronic equipment
CN110889023A (en) Distributed multifunctional search engine of elastic search
US10635725B2 (en) Providing app store search results
KR102009020B1 (en) Method and apparatus for providing website authentication data for search engine
US20200151227A1 (en) Computing system with dynamic web page feature
CN110929185B (en) Website directory detection method and device, computer equipment and computer storage medium
US10698931B1 (en) Input prediction for document text search
CN110717036A (en) Method and device for removing duplication of uniform resource locator and electronic equipment
CN110825976B (en) Website page detection method and device, electronic equipment and medium
KR20100008466A (en) Apparatus and method for eliminating double webpage
CN108009171B (en) Method and device for extracting content data
US20150127624A1 (en) Framework for removing non-authored content documents from an authored-content database

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 16854972

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 16854972

Country of ref document: EP

Kind code of ref document: A1