CN105260469B - A kind of method, apparatus and equipment for handling site maps - Google Patents
A kind of method, apparatus and equipment for handling site maps Download PDFInfo
- Publication number
- CN105260469B CN105260469B CN201510676894.0A CN201510676894A CN105260469B CN 105260469 B CN105260469 B CN 105260469B CN 201510676894 A CN201510676894 A CN 201510676894A CN 105260469 B CN105260469 B CN 105260469B
- Authority
- CN
- China
- Prior art keywords
- site maps
- website
- link
- keyword
- page
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Databases & Information Systems (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Remote Sensing (AREA)
Abstract
The present invention discloses a kind of method, apparatus and equipment for handling site maps.This method includes:The site maps of website are obtained according to presupposed information;Obtain the link of the page in site maps and conduct interviews;Influence to search for the link included in site maps according to accessing result and deleting;Generate new site maps.Technical scheme provided by the invention, site maps sitemap mass can be lifted, the possibility that searched engine is included can also be increased, meet website and the respective needs of search engine.
Description
Technical field
The present invention relates to mobile internet technical field, and in particular to it is a kind of handle site maps method, apparatus and set
It is standby.
Background technology
At present, search engine would generally search net by the link on website (also referred to as website) internal and other websites
Page, site maps sitemap can facilitate the webpage which website notice search engine has be available for crawl on website.It is simplest
Sitemap forms, it is exactly XML (Extensible Markup Language, extensible markup language) file, lists wherein
Network address in website and on each network address other metadata (time of last time renewal, the frequency of change and relative to
Significance level of other network address etc. on website), so that search engine can more intelligently capture web site contents.Briefly,
Sitemap can be understood as the list linked on website.Generation sitemap simultaneously submits to search engine, can make the interior of website
Appearance is easily included, including those hide the deep page, and this is a kind of website and the good mode of search engine dialogue.
But the quality of the web site url included inside the sitemap of current many websites offers is possible to occur much
Problem, such as break links, the content of link is inferior or does not upgrade in time, and these situations can all waste search engine and crawl
Resource, although which results in website to provide sitemap, search engine is not necessarily received according to the result crawled
Sitemap web site url is recorded, while is also possible to trigger the drop power rule of search engine, reduces the link number included to the website
Measure and reduce searching order of the website etc..
Therefore, the processing method of existing site maps, it is impossible to meet website and the respective needs of search engine.
The content of the invention
In order to solve the above technical problems, the present invention provides a kind of method, apparatus and equipment for handling site maps, can meet
Website and the respective needs of search engine.
According to an aspect of the present invention, there is provided a kind of method for handling site maps, including:
The site maps of website are obtained according to presupposed information;
Obtain the link of the page in site maps and conduct interviews;
Influence to search for the link included in site maps according to accessing result and deleting;
Generate new site maps.
Preferably, it is described to obtain the link of the page in site maps and also include after conducting interviews:
Keyword and text characteristic value are extracted to the page of access;
According to the keyword of extraction and text characteristic value and the keyword and the comparative result of text characteristic value that prestore, delete
The link that search is included is influenceed in site maps.
Preferably, influenceing the link that search is included in the result deletion site maps according to access includes:
Access result be occur the HTTP 404 that can not access it is wrong when, delete corresponding to link;Or,
When it is the page response time to be more than or equal to given threshold to access result, corresponding link is deleted;Or,
When accessing the title, keyword and imperfect description that result is the page, corresponding link is deleted;Or,
When title, keyword and the description for accessing body matter and the page that result is the page mismatch, delete corresponding
Link.
Preferably, the keyword and text characteristic value according to extraction and the keyword and the ratio of text characteristic value that prestore
Relatively result, deleting influences the link that search is included in site maps include:
It is one according to the keyword of extraction and text characteristic value and the keyword and the comparative result of text characteristic value that prestore
Cause, be judged as that content repeats to submit, delete corresponding link.
Preferably, methods described also includes:
It is supplied to search engine to access the new site maps of generation.
Preferably, methods described also includes:
Scanned for after recording the new site maps of the search engine access and that includes includes data.
According to another aspect of the present invention, there is provided a kind of device for handling site maps, including:
Acquisition module, for obtaining the site maps of website according to presupposed information;
Access modules, for the site maps obtained according to the acquisition module, obtain the link of the page in site maps
And conduct interviews;
First processing module, included for deleting influence search in site maps according to the access result of the access modules
Link;
Generation module, for generating new site maps after the first processing module is handled.
Preferably, described device also includes:
Second processing module, for extracting keyword and text characteristic value to the page of access, according to the keyword of extraction
With text characteristic value and the keyword and the comparative result of text characteristic value that prestore, deleting influences what search was included in site maps
Link;
The generation module generates new website after the first processing module and the Second processing module are handled
Map.
Preferably, described device also includes:
Output module, the new site maps for the generation module to be generated are supplied to search engine to access.
Preferably, described device also includes:
Monitoring module, scanned for for recording after the search engine accesses new site maps and that includes includes number
According to.
Preferably, the first processing module includes:
First deletes unit, for access result be occur the HTTP 404 that can not access it is wrong when, corresponding to deletion
Link;Or,
Second delete unit, for access result be the page response time be more than or equal to given threshold when, deletion pair
The link answered;Or,
3rd deletes unit, for when accessing the title, keyword and imperfect description that result is the page, deleting corresponding
Link;Or,
4th deletes unit, for accessing title, keyword and the description that result is the body matter and the page of the page
During mismatch, corresponding link is deleted.
According to another aspect of the present invention, there is provided a kind of processing equipment, including:
Memory, for storage program,
Processor, for performing the following procedure of the memory storage:
The site maps of website are obtained according to presupposed information;
Obtain the link of the page in site maps and conduct interviews;
Influence to search for the link included in site maps according to accessing result and deleting;
Generate new site maps.
It can be found that the technical scheme of the embodiment of the present invention, is first carried out by obtaining in site maps after the link of the page
Access, found according to result is accessed after having an impact the link that search is included, just deleting influences the chain that search is included in site maps
Connect, regenerate new site maps, can thus realize and processing is optimized to original site maps of website, avoid as far as possible
Occur the link that various contents are bad or easily malfunction in site maps, so as to lift site maps quality, can also increase
The possibility for adding searched engine to include, meets the needs of website and search engine.
Brief description of the drawings
Disclosure illustrative embodiments are described in more detail in conjunction with the accompanying drawings, the disclosure above-mentioned and its
Its purpose, feature and advantage will be apparent, wherein, in disclosure illustrative embodiments, identical reference number
Typically represent same parts.
Fig. 1 is the indicative flowchart of the method for processing site maps according to an embodiment of the invention;
Fig. 2 is another indicative flowchart of the method for processing site maps according to an embodiment of the invention;
Fig. 3 is another indicative flowchart of the method for processing site maps according to an embodiment of the invention;
Fig. 4 is a kind of schematic block diagram of the device of processing site maps of the present invention;
Fig. 5 is a kind of another schematic block diagram of the device of processing site maps of the present invention;
Fig. 6 is a kind of schematic block diagram of processing equipment of the present invention.
Embodiment
The preferred embodiment of the disclosure is more fully described below with reference to accompanying drawings.Although the disclosure is shown in accompanying drawing
Preferred embodiment, however, it is to be appreciated that may be realized in various forms the disclosure without the embodiment party that should be illustrated here
Formula is limited.On the contrary, these embodiments are provided so that the disclosure is more thorough and complete, and can be by the disclosure
Scope is intactly communicated to those skilled in the art.
The present invention provides a kind of method for handling site maps, can meet website and the respective needs of search engine.
Fig. 1 is the indicative flowchart of the method for processing site maps according to an embodiment of the invention.
As shown in figure 1, including:
Step 101, the site maps according to presupposed information acquisition website.
In the step, can according to website it is consensus after, the configuration information that is provided according to website obtains the net of website
Stand map.
Step 102, obtain the link of the page in site maps and conduct interviews.
In the step, each URL (Uniform Resource Locator, unified resource positioning in site maps are obtained
Symbol) link, and URL link is conducted interviews to verify respectively.
Step 103, the link that search is influenceed in site maps and is included is deleted according to access result.
In the step, include according to the link that influence search is included in result deletion site maps is accessed:
Access result be occur the HTTP 404 that can not access it is wrong when, delete corresponding to link;Or,
When it is the page response time to be more than or equal to given threshold to access result, corresponding link is deleted;Or,
When accessing the title, keyword and imperfect description that result is the page, corresponding link is deleted;Or,
When title, keyword and the description for accessing body matter and the page that result is the page mismatch, delete corresponding
Link.
Step 104, the new site maps of generation.
In the step, after each link that search is included is influenceed in deleting site maps, rearrange and generate new website
Map.
It can be found that the technical scheme of the embodiment of the present invention, is first carried out by obtaining in site maps after the link of the page
Access, found according to result is accessed after having an impact the link that search is included, just deleting influences the chain that search is included in site maps
Connect, regenerate new site maps, can thus realize and processing is optimized to original site maps of website, avoid as far as possible
Occur the link that various contents are bad or easily malfunction in site maps, so as to lift site maps quality, can also increase
The possibility for adding searched engine to include, meets the needs of website and search engine.
Technical scheme is more specifically introduced further below.
Fig. 2 is another indicative flowchart of the method for processing site maps according to an embodiment of the invention.
As shown in Fig. 2 including:
Step 201, the site maps according to presupposed information acquisition website.
The step referring to above-mentioned steps 101 description.
Step 202, obtain the link of the page in site maps and conduct interviews.
The step referring to above-mentioned steps 102 description.
Step 203, the link that search is influenceed in site maps and is included is deleted according to access result.
The step referring to above-mentioned steps 103 description.
Step 204, the page to access extract keyword and text characteristic value.
In the step, keyword extraction is carried out to the content of the page using existing algorithms of different, and to body matter
Text characteristic value is extracted, the present invention is not limited.
Step 205, keyword and text characteristic value according to extraction and the keyword to prestore and the comparison of text characteristic value
As a result, deleting influences the link that search is included in site maps.
It is keyword and text characteristic value and the keyword to prestore and the ratio of text characteristic value according to extraction in the step
Relatively result is consistent, is judged as that content repeats to submit, deletes corresponding link.
Step 206, the new site maps of generation.
Step 207, it is supplied to search engine to access the new site maps of generation.
In the step, the new site maps of generation can be replaced the original site maps in website, for search engine to net
Stand and access new site maps, can also be configured by website, new site maps are directly accessed to service platform by search engine,
The present invention is not limited, as long as search engine can be allowed to access new site maps.
It should be noted that the processing of above-mentioned steps 202,203 is closed with step 204,205 processing without the order of certainty
System, above-mentioned steps arrangement are only the convenience described.
It should be noted that it can also include after above-mentioned steps 207:After recording the new site maps of the search engine access
What is scanned for and include includes data.
It can be found that the technical scheme of the embodiment of the present invention, can delete shadow in site maps according to access result respectively
Ring link and the keyword and text characteristic value and the keyword to prestore and the ratio of text characteristic value according to extraction that search is included
Relatively result, deleting influences the link that search is included in site maps, there is provided effect of optimization.Furthermore it is also possible to record the search
Engine is scanned for after accessing new site maps and that includes includes data, so as to provide reference for follow-up site maps modification
Or analyzed for website.
Fig. 3 is another indicative flowchart of the method for processing site maps according to an embodiment of the invention.
As shown in figure 3, including:
Step 301, sitemap service platforms carry out data extraction according to the configuration information of website to the sitemap of website.
In the step, website is consensus in advance with sitemap service platforms (hereinafter referred service platform), is set by website
Put the mapping relations of sitemap and service platform, it is allowed to the configuration information such as address information that service platform provides according to website
To sitemap processing.Website sets mapping relations to be realized by XML.The setting that service platform provides according to website is believed
Breath, data extraction can be carried out to sitemap, obtain the URL information of wherein each link.
Step 302, service platform are checked the URL in the sitemap of extraction respectively, judge to access whether URL goes out
The mistakes of HTTP 404 that can not now access, if it is, into step 311, the URL is deleted from sitemap and records reason, such as
Fruit is no, into step 303.
The mistakes of HTTP 404 mean that the webpage that link is pointed to is not present, i.e. the URL failures of original web page, such case warp
It can often occur, such as:Webpage URL create-rules change, web page files are renamed or shift position, importing link misspelling etc.,
Original URL addresses are caused not access;When web page server is connected to similar request, 404 conditional codes can be returned, are told
The resource to be asked of browser is simultaneously not present.Therefore, when occur HTTP 404 that URL can not access it is wrong when, represented the URL
Through failure, the URL is now deleted from sitemap and records reason.
Step 303, service platform judge whether the page response speed for accessing URL is abnormal, if it is, into step 311,
The URL is deleted from sitemap and records reason, if not, into step 304.
When URL can be accessed normally, the response speed of the page is detected, response speed can be weighed by the response time
Amount.If the response time is more than or equal to given threshold, it is believed that response speed is abnormal, if less than given threshold, it is believed that response
Speed is normal.Given threshold, can rule of thumb value, such as be arranged to 500 milliseconds or 1 second, the present invention be not limited.
When should be noted, it can also be contrasted according to page history access response speed with current accessed response speed,
Judge whether response speed is abnormal.If the current response time is more much larger than the historical responses time, more than some threshold value, it is believed that
Response speed is abnormal.
Therefore, when page response velocity anomaly, represent that the page corresponding to the URL may net corresponding to problematic or URL
Network connection may be problematic, and these can all influence the viewing experience of user, and the URL is now deleted from sitemap and records original
Cause.
Step 304, service platform judge whether the TKD of the page is imperfect, if it is, into step 311, from sitemap
Middle deletion URL simultaneously records reason, if not, into step 305.
TKD is title title, keyword keywords, the abbreviation for describing description.TKD format content can be with
It is as follows:
<title>Here it is title content</title>
<Meta name=" keywords " content=" being key words content here "/>
<Meta name=" description " content=" being description content here "/>
Keyword keywords is a website webmaster to some page setting of website so that user is drawn by search
The vocabulary of this webpage can be searched out by holding up, and keyword represents the market orientation of website.Description, alternatively referred to as " content are described
Label ", " description label " or " synopsis ", reflect the main contents of webpage.
Usually complete TKD just meets the search rule of search engine, if TKD is imperfect, does not meet search engine
Search rule, then search engine may not search for the page, or not include the linked contents.Thus, it is found that TKD is endless
The URL is deleted from sitemap when whole and records reason.
Step 305, service platform judge whether page body content mismatches with TKD, if it is, into step 311, from
The URL is deleted in sitemap and records reason, if not, into step 306.
In the step, according to the body matter in the page, the keyword for whether occurring in TKD in text is judged, text
Whether content corresponding with TKD title and description, if there is the keyword in TKD, the content of text be with TKD title and
Description is corresponding, and expression is matching, is otherwise unmatched.If mismatch, then it is probably that text setting is wrong,
Either TKD is set wrong, and these can all influence the search quality of search engine and influence the viewing experience of user.Therefore, send out
Whether existing page body content deletes the URL from sitemap and records reason when being mismatched with TKD.
Step 306, service platform carry out keyword extraction to the content of the page, and to text contents extraction text feature
Value.
In the step, service platform can carry out keyword extraction using existing algorithms of different to the content of the page, and right
Body matter extracts text characteristic value, and the present invention is not limited.
For example, keyword extraction can use existing TFIDF (term frequency-inverse document
Frequency, word frequency -- inverted file frequency) algorithm, the algorithm is mainly to preserve all word informations with a dictionary, so
According to value value sorts to dictionary afterwards, and last weighting weight several words in the top are as keyword.For example, body matter is carried
Text characteristic value is taken, can be using the text feature based on Context Framework or based on ontological Text character extraction
Method etc..
Step 307, service platform are by the keyword of the keyword of extraction and text characteristic value and service platform storage and just
Literary characteristic value is compared, and the situation that content is submitted in repetition is checked for, if it is, into step 311, from sitemap
Middle deletion URL simultaneously records reason, if not, into step 308.
The step passes through the keyword and text feature that store the keyword of extraction and text characteristic value with service platform
Value is compared, to carry out the matching of the text degree of correlation, if having found same keyword and text feature in service platform
Value, it is judged as that content repeats.By the matching detection, so as to check for the situation that content is submitted in repetition.Taking
Business platform, prestore the keyword and text characteristic value of each page article detected.
Step 308, service platform is preserved the keyword of extraction, text characteristic value and corresponding link, for follow-up
Used in duplicate checking.
Step 309, the new sitemap data of service platform generation after treatment obtain for search engine.
In the step, it can be configured in website, instruction search engine directly arrives service platform and obtains sitemap, or
Person, service platform directly can replace new sitemap the original sitemap of website.
Step 310, service platform carry out collection situation monitoring to newest sitemap data.
Included if sitemap URL is searched engine, meeting return label information, service platform monitoring URL is searched to be drawn
Situation about including is held up, reference can be provided for follow-up adjustment sitemap.
Step 311, service platform delete the link from sitemap, and record reason and analyzed for website.
In the step, the reason for link is deleted can be recorded in detail, is analyzed for website.
It can be found that the sitemap data of the website of acquisition analyzed by the technical scheme of the embodiment of the present invention
Filter, and the checking that conducted interviews to the sitemap links provided, also carry out keyword extraction and text feature to body matter in addition
Value extraction, and the keyword with prestoring and text characteristic value are matched, so as to avoid submitting duplicate contents or poor quality
Content.Search engine can also be finally monitored to sitemap collection situation.By above-mentioned processing, the present invention can
To optimize sitemap quality, what the searched engine of lifting web site contents was included includes quantity, allows search engine preferably to include
The page of website, also solve the problems, such as that duplicate contents, rubbish contents are submitted to search drop power caused by search engine, can be with
The preferably situation of monitoring web site contents.
The method of the above-mentioned processing site maps for describing the present invention in detail, accordingly, the present invention also provides a kind of processing
The device of site maps.
Fig. 4 is a kind of schematic block diagram of the device of processing site maps of the present invention.
As shown in figure 4, a kind of device for handling site maps, including:At acquisition module 401, access modules 402, first
Manage module 403, generation module 404.The device of the processing site maps of the present invention, can be service platform or other equipment.
Acquisition module 401, for obtaining the site maps of website according to presupposed information.
Device can according to website it is consensus after, the configuration information that is provided by acquisition module 401 according to website, obtain
The site maps of website.
Access modules 402, for the site maps obtained according to the acquisition module 401, obtain the page in site maps
Link and conduct interviews.
Access modules 402 obtain each URL link in site maps, and URL link is conducted interviews to test respectively
Card.
First processing module 403, influence to search for being deleted in site maps according to the access result of the access modules 402
The link that rope is included.
First processing module 403 deletes the link that search is influenceed in site maps and is included according to various different access results.
Generation module 404, for generating new site maps after the first processing module 403 is handled.
Fig. 5 is a kind of another schematic block diagram of the device of processing site maps of the present invention.
As shown in figure 5, a kind of device for handling site maps, including:At acquisition module 401, access modules 402, first
Module 403, generation module 404 are managed, the function of each module is referring to described in Fig. 4.
In addition, described device also includes:Second processing module 405.
Second processing module 405, for extracting keyword and text characteristic value to the page of access, according to the key of extraction
Word and text characteristic value and the keyword and the comparative result of text characteristic value to prestore, deleting influences search in site maps includes
Link;The generation module 404 is raw after the first processing module 403 and the Second processing module 405 are handled
Into new site maps.
Second processing module 405 is according to the keyword and text characteristic value of extraction and the keyword to prestore and text feature
The comparative result of value is consistent, is judged as that content repeats to submit, deletes corresponding link.
Described device also includes:Output module 406.
Output module 406, the new site maps for the generation module to be generated are supplied to search engine to access.
The new site maps of generation can be replaced the original site maps in website by the present invention, be visited for search engine to website
New site maps are asked, can also be configured by website, new site maps, this hair are directly accessed to service platform by search engine
It is bright not to be limited, as long as search engine can be allowed to access new site maps.
Described device also includes:Monitoring module 407.
Monitoring module 407, scanned for for recording after the search engine accesses new site maps and that includes includes
Data.
Wherein, the first processing module 403 includes:First deletion unit 4031, second is deleted unit the 4032, the 3rd and deleted
Except unit 4033 or the 4th deletes unit 4034.
First deletes unit 4031, for when it is the HTTP404 mistakes for occurring accessing to access result, deleting corresponding
Link.
Second deletes unit 4032, for when it is the page response time to be more than or equal to given threshold to access result, deleting
Except corresponding link.
3rd deletes unit 4033, for when accessing the title, keyword and imperfect description that result is the page, deleting
Corresponding link.
4th deletes unit 4034, for access body matter that result is the page and the title of the page, keyword and
When description mismatches, corresponding link is deleted.
The present invention also provides a kind of processing equipment.
Fig. 6 is a kind of schematic block diagram of processing equipment of the present invention.
As shown in fig. 6, processing equipment includes:Memory 601 and processor 602.
Memory 601, for storage program,
Processor 602, the following procedure stored for performing the memory 601:
The site maps of website are obtained according to presupposed information;
Obtain the link of the page in site maps and conduct interviews;
Influence to search for the link included in site maps according to accessing result and deleting;
Generate new site maps.
It should be noted that other programs that memory 601 stores, referring specifically to the description in previous methods flow, herein
Repeat no more, processor 602 is additionally operable to perform other programs that memory 601 stores.
In summary, the technical scheme of the embodiment of the present invention, the sitemap data of the website of acquisition analyzed
Filter, conduct interviews checking to the sitemap links provided, also carries out keyword extraction and text characteristic value to body matter in addition
Extraction, and the keyword with prestoring and text characteristic value are matched, so as to avoid submitting duplicate contents or poor quality
Content.Search engine can also be finally monitored to sitemap collection situation.By above-mentioned processing, the present invention can be with
Optimize sitemap quality, what the searched engine of lifting web site contents was included includes quantity, allows search engine preferably to include net
The page stood, also solve the problems, such as that duplicate contents, rubbish contents search for drop power caused by being submitted to search engine, can also be more
The situation of good monitoring web site contents.
Technique according to the invention scheme above is described in detail by reference to accompanying drawing.
In addition, the method according to the invention is also implemented as a kind of computer program, the computer program includes being used for
Perform the computer program code instruction of the above steps limited in the above method of the present invention.Or according to the present invention's
Method is also implemented as a kind of computer program product, and the computer program product includes computer-readable medium, in the meter
The computer program for performing the above-mentioned function of being limited in the above method of the invention is stored with calculation machine computer-readable recording medium.Ability
Field technique personnel will also understand is that, various illustrative logical blocks, module, circuit and algorithm with reference to described by disclosure herein
Step may be implemented as the combination of electronic hardware, computer software or both.
Flow chart and block diagram in accompanying drawing show that the possibility of the system and method for multiple embodiments according to the present invention is real
Existing architectural framework, function and operation.At this point, each square frame in flow chart or block diagram can represent module, a journey
A part for sequence section or code, a part for the module, program segment or code is comprising one or more defined for realizing
The executable instruction of logic function.It should also be noted that at some as in the realization replaced, the function of being marked in square frame also may be used
With with different from the order marked in accompanying drawing generation.For example, two continuous square frames can essentially perform substantially in parallel,
They can also be performed in the opposite order sometimes, and this is depending on involved function.It is also noted that block diagram and/or stream
The combination of each square frame and block diagram in journey figure and/or the square frame in flow chart, function or operation as defined in performing can be used
Special hardware based system realize, or can be realized with the combination of specialized hardware and computer instruction.
It is described above various embodiments of the present invention, described above is exemplary, and non-exclusive, and
It is not limited to disclosed each embodiment.In the case of without departing from the scope and spirit of illustrated each embodiment, for this skill
Many modifications and changes will be apparent from for the those of ordinary skill in art field.The selection of term used herein, purport
The principle of each embodiment, practical application or improvement to the technology in market are best being explained, or is making the art
Other those of ordinary skill are understood that each embodiment disclosed herein.
Claims (10)
1. a kind of method of service platform processing site maps, wherein, service platform is consulted in advance with website, and net is set by website
Stand the mapping relations of map and service platform, with allow service platform according to the configuration information that website provides to the website of website
Figure is handled, and methods described includes:
The site maps of website are obtained according to presupposed information;
Obtain the link of the page in site maps and conduct interviews;
Influence to search for the link included in site maps according to accessing result and deleting;
Generate new site maps and site maps original on website are replaced using new site maps, wherein, it is described according to access
As a result deleting in site maps, which influences the link that search is included, includes:
Access result be occur the HTTP 404 that can not access it is wrong when, delete corresponding to link;Or,
When it is the page response time to be more than or equal to given threshold to access result, corresponding link is deleted;Or,
When accessing the title, keyword and imperfect description that result is the page, corresponding link is deleted;Or,
When title, keyword and the description for accessing body matter and the page that result is the page mismatch, chain corresponding to deletion
Connect.
2. according to the method for claim 1, it is characterised in that described to obtain the link of the page in site maps and visited
Also include after asking:
Keyword and text characteristic value are extracted to the page of access;
According to the keyword of extraction and text characteristic value and the keyword and the comparative result of text characteristic value that prestore, website is deleted
The link that search is included is influenceed in map.
3. according to the method for claim 2, it is characterised in that the keyword and text characteristic value according to extraction with it is pre-
The keyword and the comparative result of text characteristic value deposited, deleting influences the link that search is included in site maps include:
With the keyword and the comparative result of text characteristic value to prestore it is consistent according to the keyword of extraction and text characteristic value, sentences
Break and repeat to submit for content, delete corresponding link.
4. according to the method described in any one of claims 1 to 3, it is characterised in that methods described also includes:
It is supplied to search engine to access the new site maps of generation;Or
It is configured in website, instruction search engine directly obtains the new site maps of generation in service platform.
5. according to the method for claim 4, it is characterised in that methods described also includes:
Scanned for after recording the new site maps of the search engine access and that includes includes data.
6. a kind of service platform device for handling site maps, wherein, service platform device is consulted in advance with website, is set by website
The mapping relations of site maps and service platform are put, to allow the configuration information that service platform device provides according to website to website
Map is handled, and described device includes:
Acquisition module, for obtaining the site maps of website according to presupposed information;
Access modules, for the site maps obtained according to the acquisition module, the link for obtaining the page in site maps is gone forward side by side
Row accesses;
First processing module, for deleting the chain that search is influenceed in site maps and is included according to the access result of the access modules
Connect;
Generation module, for generating new site maps after the first processing module is handled and being replaced using new site maps
Original site maps on draping station,
Wherein, the first processing module includes:
First deletes unit, for access result be occur the HTTP 404 that can not access it is wrong when, delete corresponding to link;
Or,
Second deletes unit, for when it is the page response time to be more than or equal to given threshold to access result, corresponding to deletion
Link;Or,
3rd deletes unit, for when it is the title of the page, keyword and imperfect description to access result, chain corresponding to deletion
Connect;Or,
4th deletes unit, for not accessing the body matter and title of the page, keyword and description that result is the page not
Timing, delete corresponding link.
7. device according to claim 6, it is characterised in that described device also includes:
Second processing module, for extracting keyword and text characteristic value to the page of access, according to the keyword of extraction and just
Literary characteristic value and the keyword and the comparative result of text characteristic value to prestore, deleting influences the chain that search is included in site maps
Connect;
The generation module is after the first processing module and the Second processing module are handled, with generating new website
Figure.
8. device according to claim 6, it is characterised in that described device also includes:
Output module, the new site maps for the generation module to be generated are supplied to search engine to access;Or for
Website is configured, and instruction search engine directly obtains the new site maps of generation in service platform.
9. device according to claim 8, it is characterised in that described device also includes:
Monitoring module, scanned for for recording after the search engine accesses new site maps and that includes includes data.
A kind of 10. processing equipment, it is characterised in that including:
Memory, for storage program,
Processor, the program of the method as claimed in claim 1 for performing the memory storage.
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510676894.0A CN105260469B (en) | 2015-10-16 | 2015-10-16 | A kind of method, apparatus and equipment for handling site maps |
PCT/CN2016/102215 WO2017063596A1 (en) | 2015-10-16 | 2016-10-14 | Method, apparatus and device for processing sitemap |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510676894.0A CN105260469B (en) | 2015-10-16 | 2015-10-16 | A kind of method, apparatus and equipment for handling site maps |
Publications (2)
Publication Number | Publication Date |
---|---|
CN105260469A CN105260469A (en) | 2016-01-20 |
CN105260469B true CN105260469B (en) | 2017-12-26 |
Family
ID=55100159
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201510676894.0A Active CN105260469B (en) | 2015-10-16 | 2015-10-16 | A kind of method, apparatus and equipment for handling site maps |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN105260469B (en) |
WO (1) | WO2017063596A1 (en) |
Families Citing this family (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105260469B (en) * | 2015-10-16 | 2017-12-26 | 广州神马移动信息科技有限公司 | A kind of method, apparatus and equipment for handling site maps |
CN106095674B (en) * | 2016-06-07 | 2019-05-24 | 百度在线网络技术(北京)有限公司 | A kind of website automation test method and device |
CN107807937B (en) * | 2016-09-09 | 2021-11-30 | 阿里巴巴集团控股有限公司 | Website SEO processing method, device and system |
CN108255831B (en) * | 2016-12-28 | 2021-12-17 | 航天信息股份有限公司 | Method and system for generating website map for website |
CN111695056B (en) * | 2019-03-12 | 2024-03-22 | 阿里巴巴集团控股有限公司 | Page processing and page return processing methods, devices and equipment |
CN112307395A (en) * | 2020-08-10 | 2021-02-02 | 北京沃东天骏信息技术有限公司 | Method and device for generating website map |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1486457A (en) * | 2000-11-21 | 2004-03-31 | ��ķɭ��ɹ�˾ | System and process for mediated crawling |
CN102057372A (en) * | 2008-04-17 | 2011-05-11 | 谷歌公司 | Generating sitemaps |
CN104317938A (en) * | 2014-10-31 | 2015-01-28 | 北京国双科技有限公司 | Webpage validation method and device |
Family Cites Families (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7769742B1 (en) * | 2005-05-31 | 2010-08-03 | Google Inc. | Web crawler scheduler that utilizes sitemaps from websites |
US8126869B2 (en) * | 2008-02-08 | 2012-02-28 | Microsoft Corporation | Automated client sitemap generation |
US7865497B1 (en) * | 2008-02-21 | 2011-01-04 | Google Inc. | Sitemap generation where last modified time is not available to a network crawler |
CN105260469B (en) * | 2015-10-16 | 2017-12-26 | 广州神马移动信息科技有限公司 | A kind of method, apparatus and equipment for handling site maps |
-
2015
- 2015-10-16 CN CN201510676894.0A patent/CN105260469B/en active Active
-
2016
- 2016-10-14 WO PCT/CN2016/102215 patent/WO2017063596A1/en active Application Filing
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1486457A (en) * | 2000-11-21 | 2004-03-31 | ��ķɭ��ɹ�˾ | System and process for mediated crawling |
CN102057372A (en) * | 2008-04-17 | 2011-05-11 | 谷歌公司 | Generating sitemaps |
CN104317938A (en) * | 2014-10-31 | 2015-01-28 | 北京国双科技有限公司 | Webpage validation method and device |
Also Published As
Publication number | Publication date |
---|---|
WO2017063596A1 (en) | 2017-04-20 |
CN105260469A (en) | 2016-01-20 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN105260469B (en) | A kind of method, apparatus and equipment for handling site maps | |
Bar-Yossef et al. | Do not crawl in the DUST: Different URLs with similar text | |
US9614862B2 (en) | System and method for webpage analysis | |
US9251157B2 (en) | Enterprise node rank engine | |
CN104471582B (en) | The defence tracked to search engine | |
KR101584123B1 (en) | System and method of search validation | |
CN103544172B (en) | A kind of chapters and sections catalogue processing method and processing device of e-book | |
CN104683328A (en) | Method and system for scanning cross-site vulnerability | |
CN103077254B (en) | Webpage acquisition methods and device | |
KR20100084510A (en) | Identifying information related to a particular entity from electronic sources | |
US20150324350A1 (en) | Identifying Content Relationship for Content Copied by a Content Identification Mechanism | |
US20100011025A1 (en) | Transfer learning methods and apparatuses for establishing additive models for related-task ranking | |
CN103618696B (en) | Method and server for processing cookie information | |
CN107341399A (en) | Assess the method and device of code file security | |
US9792370B2 (en) | Identifying equivalent links on a page | |
CN107526718A (en) | Method and apparatus for generating text | |
CN105718533A (en) | Information pushing method and device | |
KR20030016037A (en) | Method for searching web page on popularity of visiting web pages and apparatus thereof | |
CN113032655A (en) | Method for extracting and fixing dark network electronic data | |
CN106603490A (en) | Phishing website detecting method and system | |
CN103617225B (en) | A kind of associating web pages searching method and system | |
CN107784107A (en) | Dark chain detection method and device based on flight behavior analysis | |
WO2010061990A1 (en) | Web page searching system and method using access time and frequency | |
CN104778232B (en) | Searching result optimizing method and device based on long query | |
CN111814040B (en) | Maintenance case searching method, device, terminal equipment and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
TR01 | Transfer of patent right |
Effective date of registration: 20200812 Address after: 310052 room 508, floor 5, building 4, No. 699, Wangshang Road, Changhe street, Binjiang District, Hangzhou City, Zhejiang Province Patentee after: Alibaba (China) Co.,Ltd. Address before: 510627 Guangdong city of Guangzhou province Whampoa Tianhe District Road No. 163 Xiping Yun Lu Yun Ping square B radio tower 12 layer self unit 01 Patentee before: GUANGZHOU SHENMA MOBILE INFORMATION TECHNOLOGY Co.,Ltd. |
|
TR01 | Transfer of patent right |