CN105260469B - A kind of method, apparatus and equipment for handling site maps - Google Patents

A kind of method, apparatus and equipment for handling site maps Download PDF

Info

Publication number
CN105260469B
CN105260469B CN201510676894.0A CN201510676894A CN105260469B CN 105260469 B CN105260469 B CN 105260469B CN 201510676894 A CN201510676894 A CN 201510676894A CN 105260469 B CN105260469 B CN 105260469B
Authority
CN
China
Prior art keywords
site maps
website
link
keyword
page
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201510676894.0A
Other languages
Chinese (zh)
Other versions
CN105260469A (en
Inventor
梁捷
梁卡喆
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Alibaba China Co Ltd
Original Assignee
Guangzhou Shenma Mobile Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangzhou Shenma Mobile Information Technology Co Ltd filed Critical Guangzhou Shenma Mobile Information Technology Co Ltd
Priority to CN201510676894.0A priority Critical patent/CN105260469B/en
Publication of CN105260469A publication Critical patent/CN105260469A/en
Priority to PCT/CN2016/102215 priority patent/WO2017063596A1/en
Application granted granted Critical
Publication of CN105260469B publication Critical patent/CN105260469B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Remote Sensing (AREA)

Abstract

The present invention discloses a kind of method, apparatus and equipment for handling site maps.This method includes:The site maps of website are obtained according to presupposed information;Obtain the link of the page in site maps and conduct interviews;Influence to search for the link included in site maps according to accessing result and deleting;Generate new site maps.Technical scheme provided by the invention, site maps sitemap mass can be lifted, the possibility that searched engine is included can also be increased, meet website and the respective needs of search engine.

Description

A kind of method, apparatus and equipment for handling site maps
Technical field
The present invention relates to mobile internet technical field, and in particular to it is a kind of handle site maps method, apparatus and set It is standby.
Background technology
At present, search engine would generally search net by the link on website (also referred to as website) internal and other websites Page, site maps sitemap can facilitate the webpage which website notice search engine has be available for crawl on website.It is simplest Sitemap forms, it is exactly XML (Extensible Markup Language, extensible markup language) file, lists wherein Network address in website and on each network address other metadata (time of last time renewal, the frequency of change and relative to Significance level of other network address etc. on website), so that search engine can more intelligently capture web site contents.Briefly, Sitemap can be understood as the list linked on website.Generation sitemap simultaneously submits to search engine, can make the interior of website Appearance is easily included, including those hide the deep page, and this is a kind of website and the good mode of search engine dialogue.
But the quality of the web site url included inside the sitemap of current many websites offers is possible to occur much Problem, such as break links, the content of link is inferior or does not upgrade in time, and these situations can all waste search engine and crawl Resource, although which results in website to provide sitemap, search engine is not necessarily received according to the result crawled Sitemap web site url is recorded, while is also possible to trigger the drop power rule of search engine, reduces the link number included to the website Measure and reduce searching order of the website etc..
Therefore, the processing method of existing site maps, it is impossible to meet website and the respective needs of search engine.
The content of the invention
In order to solve the above technical problems, the present invention provides a kind of method, apparatus and equipment for handling site maps, can meet Website and the respective needs of search engine.
According to an aspect of the present invention, there is provided a kind of method for handling site maps, including:
The site maps of website are obtained according to presupposed information;
Obtain the link of the page in site maps and conduct interviews;
Influence to search for the link included in site maps according to accessing result and deleting;
Generate new site maps.
Preferably, it is described to obtain the link of the page in site maps and also include after conducting interviews:
Keyword and text characteristic value are extracted to the page of access;
According to the keyword of extraction and text characteristic value and the keyword and the comparative result of text characteristic value that prestore, delete The link that search is included is influenceed in site maps.
Preferably, influenceing the link that search is included in the result deletion site maps according to access includes:
Access result be occur the HTTP 404 that can not access it is wrong when, delete corresponding to link;Or,
When it is the page response time to be more than or equal to given threshold to access result, corresponding link is deleted;Or,
When accessing the title, keyword and imperfect description that result is the page, corresponding link is deleted;Or,
When title, keyword and the description for accessing body matter and the page that result is the page mismatch, delete corresponding Link.
Preferably, the keyword and text characteristic value according to extraction and the keyword and the ratio of text characteristic value that prestore Relatively result, deleting influences the link that search is included in site maps include:
It is one according to the keyword of extraction and text characteristic value and the keyword and the comparative result of text characteristic value that prestore Cause, be judged as that content repeats to submit, delete corresponding link.
Preferably, methods described also includes:
It is supplied to search engine to access the new site maps of generation.
Preferably, methods described also includes:
Scanned for after recording the new site maps of the search engine access and that includes includes data.
According to another aspect of the present invention, there is provided a kind of device for handling site maps, including:
Acquisition module, for obtaining the site maps of website according to presupposed information;
Access modules, for the site maps obtained according to the acquisition module, obtain the link of the page in site maps And conduct interviews;
First processing module, included for deleting influence search in site maps according to the access result of the access modules Link;
Generation module, for generating new site maps after the first processing module is handled.
Preferably, described device also includes:
Second processing module, for extracting keyword and text characteristic value to the page of access, according to the keyword of extraction With text characteristic value and the keyword and the comparative result of text characteristic value that prestore, deleting influences what search was included in site maps Link;
The generation module generates new website after the first processing module and the Second processing module are handled Map.
Preferably, described device also includes:
Output module, the new site maps for the generation module to be generated are supplied to search engine to access.
Preferably, described device also includes:
Monitoring module, scanned for for recording after the search engine accesses new site maps and that includes includes number According to.
Preferably, the first processing module includes:
First deletes unit, for access result be occur the HTTP 404 that can not access it is wrong when, corresponding to deletion Link;Or,
Second delete unit, for access result be the page response time be more than or equal to given threshold when, deletion pair The link answered;Or,
3rd deletes unit, for when accessing the title, keyword and imperfect description that result is the page, deleting corresponding Link;Or,
4th deletes unit, for accessing title, keyword and the description that result is the body matter and the page of the page During mismatch, corresponding link is deleted.
According to another aspect of the present invention, there is provided a kind of processing equipment, including:
Memory, for storage program,
Processor, for performing the following procedure of the memory storage:
The site maps of website are obtained according to presupposed information;
Obtain the link of the page in site maps and conduct interviews;
Influence to search for the link included in site maps according to accessing result and deleting;
Generate new site maps.
It can be found that the technical scheme of the embodiment of the present invention, is first carried out by obtaining in site maps after the link of the page Access, found according to result is accessed after having an impact the link that search is included, just deleting influences the chain that search is included in site maps Connect, regenerate new site maps, can thus realize and processing is optimized to original site maps of website, avoid as far as possible Occur the link that various contents are bad or easily malfunction in site maps, so as to lift site maps quality, can also increase The possibility for adding searched engine to include, meets the needs of website and search engine.
Brief description of the drawings
Disclosure illustrative embodiments are described in more detail in conjunction with the accompanying drawings, the disclosure above-mentioned and its Its purpose, feature and advantage will be apparent, wherein, in disclosure illustrative embodiments, identical reference number Typically represent same parts.
Fig. 1 is the indicative flowchart of the method for processing site maps according to an embodiment of the invention;
Fig. 2 is another indicative flowchart of the method for processing site maps according to an embodiment of the invention;
Fig. 3 is another indicative flowchart of the method for processing site maps according to an embodiment of the invention;
Fig. 4 is a kind of schematic block diagram of the device of processing site maps of the present invention;
Fig. 5 is a kind of another schematic block diagram of the device of processing site maps of the present invention;
Fig. 6 is a kind of schematic block diagram of processing equipment of the present invention.
Embodiment
The preferred embodiment of the disclosure is more fully described below with reference to accompanying drawings.Although the disclosure is shown in accompanying drawing Preferred embodiment, however, it is to be appreciated that may be realized in various forms the disclosure without the embodiment party that should be illustrated here Formula is limited.On the contrary, these embodiments are provided so that the disclosure is more thorough and complete, and can be by the disclosure Scope is intactly communicated to those skilled in the art.
The present invention provides a kind of method for handling site maps, can meet website and the respective needs of search engine.
Fig. 1 is the indicative flowchart of the method for processing site maps according to an embodiment of the invention.
As shown in figure 1, including:
Step 101, the site maps according to presupposed information acquisition website.
In the step, can according to website it is consensus after, the configuration information that is provided according to website obtains the net of website Stand map.
Step 102, obtain the link of the page in site maps and conduct interviews.
In the step, each URL (Uniform Resource Locator, unified resource positioning in site maps are obtained Symbol) link, and URL link is conducted interviews to verify respectively.
Step 103, the link that search is influenceed in site maps and is included is deleted according to access result.
In the step, include according to the link that influence search is included in result deletion site maps is accessed:
Access result be occur the HTTP 404 that can not access it is wrong when, delete corresponding to link;Or,
When it is the page response time to be more than or equal to given threshold to access result, corresponding link is deleted;Or,
When accessing the title, keyword and imperfect description that result is the page, corresponding link is deleted;Or,
When title, keyword and the description for accessing body matter and the page that result is the page mismatch, delete corresponding Link.
Step 104, the new site maps of generation.
In the step, after each link that search is included is influenceed in deleting site maps, rearrange and generate new website Map.
It can be found that the technical scheme of the embodiment of the present invention, is first carried out by obtaining in site maps after the link of the page Access, found according to result is accessed after having an impact the link that search is included, just deleting influences the chain that search is included in site maps Connect, regenerate new site maps, can thus realize and processing is optimized to original site maps of website, avoid as far as possible Occur the link that various contents are bad or easily malfunction in site maps, so as to lift site maps quality, can also increase The possibility for adding searched engine to include, meets the needs of website and search engine.
Technical scheme is more specifically introduced further below.
Fig. 2 is another indicative flowchart of the method for processing site maps according to an embodiment of the invention.
As shown in Fig. 2 including:
Step 201, the site maps according to presupposed information acquisition website.
The step referring to above-mentioned steps 101 description.
Step 202, obtain the link of the page in site maps and conduct interviews.
The step referring to above-mentioned steps 102 description.
Step 203, the link that search is influenceed in site maps and is included is deleted according to access result.
The step referring to above-mentioned steps 103 description.
Step 204, the page to access extract keyword and text characteristic value.
In the step, keyword extraction is carried out to the content of the page using existing algorithms of different, and to body matter Text characteristic value is extracted, the present invention is not limited.
Step 205, keyword and text characteristic value according to extraction and the keyword to prestore and the comparison of text characteristic value As a result, deleting influences the link that search is included in site maps.
It is keyword and text characteristic value and the keyword to prestore and the ratio of text characteristic value according to extraction in the step Relatively result is consistent, is judged as that content repeats to submit, deletes corresponding link.
Step 206, the new site maps of generation.
Step 207, it is supplied to search engine to access the new site maps of generation.
In the step, the new site maps of generation can be replaced the original site maps in website, for search engine to net Stand and access new site maps, can also be configured by website, new site maps are directly accessed to service platform by search engine, The present invention is not limited, as long as search engine can be allowed to access new site maps.
It should be noted that the processing of above-mentioned steps 202,203 is closed with step 204,205 processing without the order of certainty System, above-mentioned steps arrangement are only the convenience described.
It should be noted that it can also include after above-mentioned steps 207:After recording the new site maps of the search engine access What is scanned for and include includes data.
It can be found that the technical scheme of the embodiment of the present invention, can delete shadow in site maps according to access result respectively Ring link and the keyword and text characteristic value and the keyword to prestore and the ratio of text characteristic value according to extraction that search is included Relatively result, deleting influences the link that search is included in site maps, there is provided effect of optimization.Furthermore it is also possible to record the search Engine is scanned for after accessing new site maps and that includes includes data, so as to provide reference for follow-up site maps modification Or analyzed for website.
Fig. 3 is another indicative flowchart of the method for processing site maps according to an embodiment of the invention.
As shown in figure 3, including:
Step 301, sitemap service platforms carry out data extraction according to the configuration information of website to the sitemap of website.
In the step, website is consensus in advance with sitemap service platforms (hereinafter referred service platform), is set by website Put the mapping relations of sitemap and service platform, it is allowed to the configuration information such as address information that service platform provides according to website To sitemap processing.Website sets mapping relations to be realized by XML.The setting that service platform provides according to website is believed Breath, data extraction can be carried out to sitemap, obtain the URL information of wherein each link.
Step 302, service platform are checked the URL in the sitemap of extraction respectively, judge to access whether URL goes out The mistakes of HTTP 404 that can not now access, if it is, into step 311, the URL is deleted from sitemap and records reason, such as Fruit is no, into step 303.
The mistakes of HTTP 404 mean that the webpage that link is pointed to is not present, i.e. the URL failures of original web page, such case warp It can often occur, such as:Webpage URL create-rules change, web page files are renamed or shift position, importing link misspelling etc., Original URL addresses are caused not access;When web page server is connected to similar request, 404 conditional codes can be returned, are told The resource to be asked of browser is simultaneously not present.Therefore, when occur HTTP 404 that URL can not access it is wrong when, represented the URL Through failure, the URL is now deleted from sitemap and records reason.
Step 303, service platform judge whether the page response speed for accessing URL is abnormal, if it is, into step 311, The URL is deleted from sitemap and records reason, if not, into step 304.
When URL can be accessed normally, the response speed of the page is detected, response speed can be weighed by the response time Amount.If the response time is more than or equal to given threshold, it is believed that response speed is abnormal, if less than given threshold, it is believed that response Speed is normal.Given threshold, can rule of thumb value, such as be arranged to 500 milliseconds or 1 second, the present invention be not limited.
When should be noted, it can also be contrasted according to page history access response speed with current accessed response speed, Judge whether response speed is abnormal.If the current response time is more much larger than the historical responses time, more than some threshold value, it is believed that Response speed is abnormal.
Therefore, when page response velocity anomaly, represent that the page corresponding to the URL may net corresponding to problematic or URL Network connection may be problematic, and these can all influence the viewing experience of user, and the URL is now deleted from sitemap and records original Cause.
Step 304, service platform judge whether the TKD of the page is imperfect, if it is, into step 311, from sitemap Middle deletion URL simultaneously records reason, if not, into step 305.
TKD is title title, keyword keywords, the abbreviation for describing description.TKD format content can be with It is as follows:
<title>Here it is title content</title>
<Meta name=" keywords " content=" being key words content here "/>
<Meta name=" description " content=" being description content here "/>
Keyword keywords is a website webmaster to some page setting of website so that user is drawn by search The vocabulary of this webpage can be searched out by holding up, and keyword represents the market orientation of website.Description, alternatively referred to as " content are described Label ", " description label " or " synopsis ", reflect the main contents of webpage.
Usually complete TKD just meets the search rule of search engine, if TKD is imperfect, does not meet search engine Search rule, then search engine may not search for the page, or not include the linked contents.Thus, it is found that TKD is endless The URL is deleted from sitemap when whole and records reason.
Step 305, service platform judge whether page body content mismatches with TKD, if it is, into step 311, from The URL is deleted in sitemap and records reason, if not, into step 306.
In the step, according to the body matter in the page, the keyword for whether occurring in TKD in text is judged, text Whether content corresponding with TKD title and description, if there is the keyword in TKD, the content of text be with TKD title and Description is corresponding, and expression is matching, is otherwise unmatched.If mismatch, then it is probably that text setting is wrong, Either TKD is set wrong, and these can all influence the search quality of search engine and influence the viewing experience of user.Therefore, send out Whether existing page body content deletes the URL from sitemap and records reason when being mismatched with TKD.
Step 306, service platform carry out keyword extraction to the content of the page, and to text contents extraction text feature Value.
In the step, service platform can carry out keyword extraction using existing algorithms of different to the content of the page, and right Body matter extracts text characteristic value, and the present invention is not limited.
For example, keyword extraction can use existing TFIDF (term frequency-inverse document Frequency, word frequency -- inverted file frequency) algorithm, the algorithm is mainly to preserve all word informations with a dictionary, so According to value value sorts to dictionary afterwards, and last weighting weight several words in the top are as keyword.For example, body matter is carried Text characteristic value is taken, can be using the text feature based on Context Framework or based on ontological Text character extraction Method etc..
Step 307, service platform are by the keyword of the keyword of extraction and text characteristic value and service platform storage and just Literary characteristic value is compared, and the situation that content is submitted in repetition is checked for, if it is, into step 311, from sitemap Middle deletion URL simultaneously records reason, if not, into step 308.
The step passes through the keyword and text feature that store the keyword of extraction and text characteristic value with service platform Value is compared, to carry out the matching of the text degree of correlation, if having found same keyword and text feature in service platform Value, it is judged as that content repeats.By the matching detection, so as to check for the situation that content is submitted in repetition.Taking Business platform, prestore the keyword and text characteristic value of each page article detected.
Step 308, service platform is preserved the keyword of extraction, text characteristic value and corresponding link, for follow-up Used in duplicate checking.
Step 309, the new sitemap data of service platform generation after treatment obtain for search engine.
In the step, it can be configured in website, instruction search engine directly arrives service platform and obtains sitemap, or Person, service platform directly can replace new sitemap the original sitemap of website.
Step 310, service platform carry out collection situation monitoring to newest sitemap data.
Included if sitemap URL is searched engine, meeting return label information, service platform monitoring URL is searched to be drawn Situation about including is held up, reference can be provided for follow-up adjustment sitemap.
Step 311, service platform delete the link from sitemap, and record reason and analyzed for website.
In the step, the reason for link is deleted can be recorded in detail, is analyzed for website.
It can be found that the sitemap data of the website of acquisition analyzed by the technical scheme of the embodiment of the present invention Filter, and the checking that conducted interviews to the sitemap links provided, also carry out keyword extraction and text feature to body matter in addition Value extraction, and the keyword with prestoring and text characteristic value are matched, so as to avoid submitting duplicate contents or poor quality Content.Search engine can also be finally monitored to sitemap collection situation.By above-mentioned processing, the present invention can To optimize sitemap quality, what the searched engine of lifting web site contents was included includes quantity, allows search engine preferably to include The page of website, also solve the problems, such as that duplicate contents, rubbish contents are submitted to search drop power caused by search engine, can be with The preferably situation of monitoring web site contents.
The method of the above-mentioned processing site maps for describing the present invention in detail, accordingly, the present invention also provides a kind of processing The device of site maps.
Fig. 4 is a kind of schematic block diagram of the device of processing site maps of the present invention.
As shown in figure 4, a kind of device for handling site maps, including:At acquisition module 401, access modules 402, first Manage module 403, generation module 404.The device of the processing site maps of the present invention, can be service platform or other equipment.
Acquisition module 401, for obtaining the site maps of website according to presupposed information.
Device can according to website it is consensus after, the configuration information that is provided by acquisition module 401 according to website, obtain The site maps of website.
Access modules 402, for the site maps obtained according to the acquisition module 401, obtain the page in site maps Link and conduct interviews.
Access modules 402 obtain each URL link in site maps, and URL link is conducted interviews to test respectively Card.
First processing module 403, influence to search for being deleted in site maps according to the access result of the access modules 402 The link that rope is included.
First processing module 403 deletes the link that search is influenceed in site maps and is included according to various different access results.
Generation module 404, for generating new site maps after the first processing module 403 is handled.
Fig. 5 is a kind of another schematic block diagram of the device of processing site maps of the present invention.
As shown in figure 5, a kind of device for handling site maps, including:At acquisition module 401, access modules 402, first Module 403, generation module 404 are managed, the function of each module is referring to described in Fig. 4.
In addition, described device also includes:Second processing module 405.
Second processing module 405, for extracting keyword and text characteristic value to the page of access, according to the key of extraction Word and text characteristic value and the keyword and the comparative result of text characteristic value to prestore, deleting influences search in site maps includes Link;The generation module 404 is raw after the first processing module 403 and the Second processing module 405 are handled Into new site maps.
Second processing module 405 is according to the keyword and text characteristic value of extraction and the keyword to prestore and text feature The comparative result of value is consistent, is judged as that content repeats to submit, deletes corresponding link.
Described device also includes:Output module 406.
Output module 406, the new site maps for the generation module to be generated are supplied to search engine to access.
The new site maps of generation can be replaced the original site maps in website by the present invention, be visited for search engine to website New site maps are asked, can also be configured by website, new site maps, this hair are directly accessed to service platform by search engine It is bright not to be limited, as long as search engine can be allowed to access new site maps.
Described device also includes:Monitoring module 407.
Monitoring module 407, scanned for for recording after the search engine accesses new site maps and that includes includes Data.
Wherein, the first processing module 403 includes:First deletion unit 4031, second is deleted unit the 4032, the 3rd and deleted Except unit 4033 or the 4th deletes unit 4034.
First deletes unit 4031, for when it is the HTTP404 mistakes for occurring accessing to access result, deleting corresponding Link.
Second deletes unit 4032, for when it is the page response time to be more than or equal to given threshold to access result, deleting Except corresponding link.
3rd deletes unit 4033, for when accessing the title, keyword and imperfect description that result is the page, deleting Corresponding link.
4th deletes unit 4034, for access body matter that result is the page and the title of the page, keyword and When description mismatches, corresponding link is deleted.
The present invention also provides a kind of processing equipment.
Fig. 6 is a kind of schematic block diagram of processing equipment of the present invention.
As shown in fig. 6, processing equipment includes:Memory 601 and processor 602.
Memory 601, for storage program,
Processor 602, the following procedure stored for performing the memory 601:
The site maps of website are obtained according to presupposed information;
Obtain the link of the page in site maps and conduct interviews;
Influence to search for the link included in site maps according to accessing result and deleting;
Generate new site maps.
It should be noted that other programs that memory 601 stores, referring specifically to the description in previous methods flow, herein Repeat no more, processor 602 is additionally operable to perform other programs that memory 601 stores.
In summary, the technical scheme of the embodiment of the present invention, the sitemap data of the website of acquisition analyzed Filter, conduct interviews checking to the sitemap links provided, also carries out keyword extraction and text characteristic value to body matter in addition Extraction, and the keyword with prestoring and text characteristic value are matched, so as to avoid submitting duplicate contents or poor quality Content.Search engine can also be finally monitored to sitemap collection situation.By above-mentioned processing, the present invention can be with Optimize sitemap quality, what the searched engine of lifting web site contents was included includes quantity, allows search engine preferably to include net The page stood, also solve the problems, such as that duplicate contents, rubbish contents search for drop power caused by being submitted to search engine, can also be more The situation of good monitoring web site contents.
Technique according to the invention scheme above is described in detail by reference to accompanying drawing.
In addition, the method according to the invention is also implemented as a kind of computer program, the computer program includes being used for Perform the computer program code instruction of the above steps limited in the above method of the present invention.Or according to the present invention's Method is also implemented as a kind of computer program product, and the computer program product includes computer-readable medium, in the meter The computer program for performing the above-mentioned function of being limited in the above method of the invention is stored with calculation machine computer-readable recording medium.Ability Field technique personnel will also understand is that, various illustrative logical blocks, module, circuit and algorithm with reference to described by disclosure herein Step may be implemented as the combination of electronic hardware, computer software or both.
Flow chart and block diagram in accompanying drawing show that the possibility of the system and method for multiple embodiments according to the present invention is real Existing architectural framework, function and operation.At this point, each square frame in flow chart or block diagram can represent module, a journey A part for sequence section or code, a part for the module, program segment or code is comprising one or more defined for realizing The executable instruction of logic function.It should also be noted that at some as in the realization replaced, the function of being marked in square frame also may be used With with different from the order marked in accompanying drawing generation.For example, two continuous square frames can essentially perform substantially in parallel, They can also be performed in the opposite order sometimes, and this is depending on involved function.It is also noted that block diagram and/or stream The combination of each square frame and block diagram in journey figure and/or the square frame in flow chart, function or operation as defined in performing can be used Special hardware based system realize, or can be realized with the combination of specialized hardware and computer instruction.
It is described above various embodiments of the present invention, described above is exemplary, and non-exclusive, and It is not limited to disclosed each embodiment.In the case of without departing from the scope and spirit of illustrated each embodiment, for this skill Many modifications and changes will be apparent from for the those of ordinary skill in art field.The selection of term used herein, purport The principle of each embodiment, practical application or improvement to the technology in market are best being explained, or is making the art Other those of ordinary skill are understood that each embodiment disclosed herein.

Claims (10)

1. a kind of method of service platform processing site maps, wherein, service platform is consulted in advance with website, and net is set by website Stand the mapping relations of map and service platform, with allow service platform according to the configuration information that website provides to the website of website Figure is handled, and methods described includes:
The site maps of website are obtained according to presupposed information;
Obtain the link of the page in site maps and conduct interviews;
Influence to search for the link included in site maps according to accessing result and deleting;
Generate new site maps and site maps original on website are replaced using new site maps, wherein, it is described according to access As a result deleting in site maps, which influences the link that search is included, includes:
Access result be occur the HTTP 404 that can not access it is wrong when, delete corresponding to link;Or,
When it is the page response time to be more than or equal to given threshold to access result, corresponding link is deleted;Or,
When accessing the title, keyword and imperfect description that result is the page, corresponding link is deleted;Or,
When title, keyword and the description for accessing body matter and the page that result is the page mismatch, chain corresponding to deletion Connect.
2. according to the method for claim 1, it is characterised in that described to obtain the link of the page in site maps and visited Also include after asking:
Keyword and text characteristic value are extracted to the page of access;
According to the keyword of extraction and text characteristic value and the keyword and the comparative result of text characteristic value that prestore, website is deleted The link that search is included is influenceed in map.
3. according to the method for claim 2, it is characterised in that the keyword and text characteristic value according to extraction with it is pre- The keyword and the comparative result of text characteristic value deposited, deleting influences the link that search is included in site maps include:
With the keyword and the comparative result of text characteristic value to prestore it is consistent according to the keyword of extraction and text characteristic value, sentences Break and repeat to submit for content, delete corresponding link.
4. according to the method described in any one of claims 1 to 3, it is characterised in that methods described also includes:
It is supplied to search engine to access the new site maps of generation;Or
It is configured in website, instruction search engine directly obtains the new site maps of generation in service platform.
5. according to the method for claim 4, it is characterised in that methods described also includes:
Scanned for after recording the new site maps of the search engine access and that includes includes data.
6. a kind of service platform device for handling site maps, wherein, service platform device is consulted in advance with website, is set by website The mapping relations of site maps and service platform are put, to allow the configuration information that service platform device provides according to website to website Map is handled, and described device includes:
Acquisition module, for obtaining the site maps of website according to presupposed information;
Access modules, for the site maps obtained according to the acquisition module, the link for obtaining the page in site maps is gone forward side by side Row accesses;
First processing module, for deleting the chain that search is influenceed in site maps and is included according to the access result of the access modules Connect;
Generation module, for generating new site maps after the first processing module is handled and being replaced using new site maps Original site maps on draping station,
Wherein, the first processing module includes:
First deletes unit, for access result be occur the HTTP 404 that can not access it is wrong when, delete corresponding to link; Or,
Second deletes unit, for when it is the page response time to be more than or equal to given threshold to access result, corresponding to deletion Link;Or,
3rd deletes unit, for when it is the title of the page, keyword and imperfect description to access result, chain corresponding to deletion Connect;Or,
4th deletes unit, for not accessing the body matter and title of the page, keyword and description that result is the page not Timing, delete corresponding link.
7. device according to claim 6, it is characterised in that described device also includes:
Second processing module, for extracting keyword and text characteristic value to the page of access, according to the keyword of extraction and just Literary characteristic value and the keyword and the comparative result of text characteristic value to prestore, deleting influences the chain that search is included in site maps Connect;
The generation module is after the first processing module and the Second processing module are handled, with generating new website Figure.
8. device according to claim 6, it is characterised in that described device also includes:
Output module, the new site maps for the generation module to be generated are supplied to search engine to access;Or for Website is configured, and instruction search engine directly obtains the new site maps of generation in service platform.
9. device according to claim 8, it is characterised in that described device also includes:
Monitoring module, scanned for for recording after the search engine accesses new site maps and that includes includes data.
A kind of 10. processing equipment, it is characterised in that including:
Memory, for storage program,
Processor, the program of the method as claimed in claim 1 for performing the memory storage.
CN201510676894.0A 2015-10-16 2015-10-16 A kind of method, apparatus and equipment for handling site maps Active CN105260469B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201510676894.0A CN105260469B (en) 2015-10-16 2015-10-16 A kind of method, apparatus and equipment for handling site maps
PCT/CN2016/102215 WO2017063596A1 (en) 2015-10-16 2016-10-14 Method, apparatus and device for processing sitemap

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510676894.0A CN105260469B (en) 2015-10-16 2015-10-16 A kind of method, apparatus and equipment for handling site maps

Publications (2)

Publication Number Publication Date
CN105260469A CN105260469A (en) 2016-01-20
CN105260469B true CN105260469B (en) 2017-12-26

Family

ID=55100159

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510676894.0A Active CN105260469B (en) 2015-10-16 2015-10-16 A kind of method, apparatus and equipment for handling site maps

Country Status (2)

Country Link
CN (1) CN105260469B (en)
WO (1) WO2017063596A1 (en)

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105260469B (en) * 2015-10-16 2017-12-26 广州神马移动信息科技有限公司 A kind of method, apparatus and equipment for handling site maps
CN106095674B (en) * 2016-06-07 2019-05-24 百度在线网络技术(北京)有限公司 A kind of website automation test method and device
CN107807937B (en) * 2016-09-09 2021-11-30 阿里巴巴集团控股有限公司 Website SEO processing method, device and system
CN108255831B (en) * 2016-12-28 2021-12-17 航天信息股份有限公司 Method and system for generating website map for website
CN111695056B (en) * 2019-03-12 2024-03-22 阿里巴巴集团控股有限公司 Page processing and page return processing methods, devices and equipment
CN112307395A (en) * 2020-08-10 2021-02-02 北京沃东天骏信息技术有限公司 Method and device for generating website map

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1486457A (en) * 2000-11-21 2004-03-31 ��ķɭ��ɹ�˾ System and process for mediated crawling
CN102057372A (en) * 2008-04-17 2011-05-11 谷歌公司 Generating sitemaps
CN104317938A (en) * 2014-10-31 2015-01-28 北京国双科技有限公司 Webpage validation method and device

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7769742B1 (en) * 2005-05-31 2010-08-03 Google Inc. Web crawler scheduler that utilizes sitemaps from websites
US8126869B2 (en) * 2008-02-08 2012-02-28 Microsoft Corporation Automated client sitemap generation
US7865497B1 (en) * 2008-02-21 2011-01-04 Google Inc. Sitemap generation where last modified time is not available to a network crawler
CN105260469B (en) * 2015-10-16 2017-12-26 广州神马移动信息科技有限公司 A kind of method, apparatus and equipment for handling site maps

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1486457A (en) * 2000-11-21 2004-03-31 ��ķɭ��ɹ�˾ System and process for mediated crawling
CN102057372A (en) * 2008-04-17 2011-05-11 谷歌公司 Generating sitemaps
CN104317938A (en) * 2014-10-31 2015-01-28 北京国双科技有限公司 Webpage validation method and device

Also Published As

Publication number Publication date
WO2017063596A1 (en) 2017-04-20
CN105260469A (en) 2016-01-20

Similar Documents

Publication Publication Date Title
CN105260469B (en) A kind of method, apparatus and equipment for handling site maps
Bar-Yossef et al. Do not crawl in the DUST: Different URLs with similar text
US9614862B2 (en) System and method for webpage analysis
US9251157B2 (en) Enterprise node rank engine
CN104471582B (en) The defence tracked to search engine
KR101584123B1 (en) System and method of search validation
CN103544172B (en) A kind of chapters and sections catalogue processing method and processing device of e-book
CN104683328A (en) Method and system for scanning cross-site vulnerability
CN103077254B (en) Webpage acquisition methods and device
KR20100084510A (en) Identifying information related to a particular entity from electronic sources
US20150324350A1 (en) Identifying Content Relationship for Content Copied by a Content Identification Mechanism
US20100011025A1 (en) Transfer learning methods and apparatuses for establishing additive models for related-task ranking
CN103618696B (en) Method and server for processing cookie information
CN107341399A (en) Assess the method and device of code file security
US9792370B2 (en) Identifying equivalent links on a page
CN107526718A (en) Method and apparatus for generating text
CN105718533A (en) Information pushing method and device
KR20030016037A (en) Method for searching web page on popularity of visiting web pages and apparatus thereof
CN113032655A (en) Method for extracting and fixing dark network electronic data
CN106603490A (en) Phishing website detecting method and system
CN103617225B (en) A kind of associating web pages searching method and system
CN107784107A (en) Dark chain detection method and device based on flight behavior analysis
WO2010061990A1 (en) Web page searching system and method using access time and frequency
CN104778232B (en) Searching result optimizing method and device based on long query
CN111814040B (en) Maintenance case searching method, device, terminal equipment and storage medium

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
TR01 Transfer of patent right

Effective date of registration: 20200812

Address after: 310052 room 508, floor 5, building 4, No. 699, Wangshang Road, Changhe street, Binjiang District, Hangzhou City, Zhejiang Province

Patentee after: Alibaba (China) Co.,Ltd.

Address before: 510627 Guangdong city of Guangzhou province Whampoa Tianhe District Road No. 163 Xiping Yun Lu Yun Ping square B radio tower 12 layer self unit 01

Patentee before: GUANGZHOU SHENMA MOBILE INFORMATION TECHNOLOGY Co.,Ltd.

TR01 Transfer of patent right