CN104636340A - Webpage URL filtering method, device and system - Google Patents

Webpage URL filtering method, device and system Download PDF

Info

Publication number
CN104636340A
CN104636340A CN201310547585.4A CN201310547585A CN104636340A CN 104636340 A CN104636340 A CN 104636340A CN 201310547585 A CN201310547585 A CN 201310547585A CN 104636340 A CN104636340 A CN 104636340A
Authority
CN
China
Prior art keywords
url
field
webpage
data
unique
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201310547585.4A
Other languages
Chinese (zh)
Inventor
蔡兵
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN201310547585.4A priority Critical patent/CN104636340A/en
Publication of CN104636340A publication Critical patent/CN104636340A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/955Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
    • G06F16/9566URL specific, e.g. using aliases, detecting broken or misspelled links

Abstract

The invention relates to a webpage URL filtering method, device and system. The webpage URL filtering method includes the steps of obtaining a pre-collected URL data set in one webpage; when the URL data set comprises a plurality of URLs, conducting field disassembling and analysis on each URL in the URL data set; according to field disassembling and analyzing results, removing unrelated fields in the field disassembling and analyzing results to generate a unique URL of the webpage. According to the webpage URL filtering method, device and system, various different URL formats of the same webpage are identified, the fields, unrelated to contents of the webpage, in the URLs are filtered, the various URLs in the same webpage are converted into the unique URL, the memory size of webpage data can be effectively reduced, and the statistics efficiency and the statistics accuracy of the webpage visiting amount are improved; compared with a traditional scheme, the manual intervention is not needed, the website coverage is wide, and calculation results are accurate.

Description

Webpage url filtering method, Apparatus and system
Technical field
The present invention relates to Internet technical field, particularly relate to a kind of webpage url filtering method, Apparatus and system.
Background technology
At present, the webpage quantity of internet is explosive growth, some webmaster is for the ease of the source of statistical web page and channel, usual meeting increases some extended fields on the basis of the original URL of webpage, such as " http://blog.sina.com.cn/s/blog_4ac981db0102f1ta.html tj=1 " is inner " tj=1 " be a kind of webpage redirect source that webmaster defines, again such as " http://km.oa.com/group/469/surveys/show/5772 jumpfrom=systemmail " is inner " jumpfrom=systemmail " represent that this webpage is redirect after user clicks from certain system Mail.
In above-mentioned two examples, after webpage URL end segment is deleted, former web page contents does not have any change.And the phenomenon that current various URL carries specific purpose parameter secretly gets more and more, cause same webpage to have multiple URL, so the same webpage of identification, the access number of adding up same webpage, storage web page contents are all made troubles.Therefore by identifying and filtering web page URL nuisance parameter, various URL corresponding for same webpage are reduced to its most original mode necessary.
At present, mainly identified by the mode of artificial treatment and screen various URL corresponding to same webpage.For two examples above, after the invalid field that artificial judgment goes out end, by writing transformation rule, the URL of various band nuisance parameter can be converted to its original URL.
Although existing this artificial treatment mode is flexibly and fast, because the parameter format of different web sites may be completely different, and may often expand new parameter format, therefore manual maintenance cost prohibitive, and the Websites quantity that can cover is relatively little.
Summary of the invention
The embodiment of the present invention provides a kind of webpage url filtering method, Apparatus and system, is intended to the efficiency improving web data statistics, facilitates Resource Storage.
The embodiment of the present invention proposes a kind of webpage url filtering method, comprising:
Obtain the url data collection of the same webpage collected in advance;
When described url data concentrate comprise multiple URL time, to described url data concentrate each URL carry out field fractionation and analysis;
Split and analysis result according to field, remove irrelevant field wherein, generate unique URL of described webpage.
The embodiment of the present invention also proposes the device of a kind of filtering web page URL, comprising:
Url data acquisition module, for obtaining the url data collection of the same webpage collected in advance;
Field splits and analysis module, when comprising multiple URL for concentrating when described url data, carries out field fractionation and analysis to each URL that described url data is concentrated;
Generation module, for splitting and analysis result according to field, removing irrelevant field wherein, generating unique URL of described webpage.
The embodiment of the present invention also proposes the system of a kind of filtering web page URL, the data monitoring platform comprising browser and communicate to connect with described browser, wherein:
Described browser comprises device as above;
Described data monitoring platform, the Field Sanitization error message reported when the unique URL for judging to generate when described browser is invalid.
A kind of webpage url filtering method, Apparatus and system that the embodiment of the present invention proposes, by identifying the various different URL forms of same webpage, filter field irrelevant with web page contents in URL, the various URL of same webpage are converted to unique URL, effectively can reduce the memory space of web data, improve web page access quantitative statistics efficiency and accuracy, it compares traditional scheme, do not need manual intervention, and it is extensive to cover website, result of calculation is accurate.
Accompanying drawing explanation
Fig. 1 is the schematic flow sheet of webpage url filtering method first embodiment of the present invention;
Fig. 2 is the schematic flow sheet of webpage url filtering method second embodiment of the present invention;
Fig. 3 is the schematic flow sheet of webpage url filtering method the 3rd embodiment of the present invention;
Fig. 4 is the high-level schematic functional block diagram of device first embodiment of filtering web page URL of the present invention;
Fig. 5 is the structural representation of field fractionation and analysis module in the embodiment of the present invention;
Fig. 6 is the high-level schematic functional block diagram of device second embodiment of filtering web page URL of the present invention;
Fig. 7 is the structural representation of correction verification module in the embodiment of the present invention;
Fig. 8 is the high-level schematic functional block diagram of device the 3rd embodiment of filtering web page URL of the present invention;
Fig. 9 is the structural representation of the system preferred embodiment of filtering web page URL of the present invention.
In order to make technical scheme of the present invention clearly, understand, be described in further detail below in conjunction with accompanying drawing.
Embodiment
Should be appreciated that specific embodiment described herein only in order to explain the present invention, be not intended to limit the present invention.
As shown in Figure 1, first embodiment of the invention proposes a kind of webpage url filtering method, comprising:
Step S101, obtains the url data collection of the same webpage collected in advance;
The present embodiment can robotization realize identifying and filtering web page URL nuisance parameter, various URL corresponding for same webpage is reduced to its most original mode, to improve web page access quantitative statistics efficiency and accuracy, and reduces the memory space of web data.
Particularly, first the url data collection of the same webpage collected in advance is obtained, as a kind of embodiment, can number (docid) obtains the same webpage that this browser backstage is collected url data from browser backstage based on web document, this url data can be expressed as: <docid, url>, wherein, docid is the document code of webpage, and it comprises 64 character strings, docid unique identification webpage.
Thus, each URL of this same webpage can be got from browser backstage by the docid of same webpage.Meanwhile, the original docid as this webpage also preserves by the docid of this same webpage, can as the use of subsequent check.
Step S102, when described url data concentrate comprise multiple URL time, to described url data concentrate each URL carry out field fractionation and analysis;
As previously mentioned, the URL of same webpage may have multiple, when the same webpage obtained from browser backstage url data concentrate comprise multiple URL time, then need to filter the URL of this same webpage, to remove irrelevant field wherein.
When carrying out filtration treatment to the URL of webpage, first field fractionation and analysis being carried out to each URL of same webpage, identifying the irrelevant field in each URL under same docid.
As a kind of embodiment, the processing procedure of each URL of same webpage being carried out to field fractionation and analysis is as follows:
First to each URL that url data is concentrated, obtain separator wherein, by separator, field fractionation is carried out to each URL, obtain all fields in each URL.
Usual url field is mainly used " & ", " " etc. symbol separate, these separators therefore can be utilized each URL to be carried out each field splitting to obtain wherein.
Such as: the field that http://tech.3g.cn/ipnewscontent.php nid=2149583 & sid=008A263A71D obtains after splitting is " nid=2149583 " and " sid=008A263A71D ".
After field splits, then identify irrelevant field, the concrete method adopted is:
Analyze each field in each URL, if some fields occur in each URL, then think that it is mandatory field, if otherwise the not appearance at least one URL of certain field, then think that it is irrelevant field, after deleting it, any impact can not be brought on the former webpage of access.Illustrate with an example below, assuming that certain webpage exists 3 URL:
http://www.a.com/b?cid=1;
http://www.a.com/b?cid=1&sid=2;
http://www.a.com/b?cid=1&sid=2&from=systemmail;
Split through field and analyze, " cid=1 " field all occurs at each URL, then judge that it is mandatory field; And " sid=2 " and " from=systemmail " does not all occur at least 1 URL, therefore judge that it is irrelevant field.
Step S103, splits and analysis result according to field, removes irrelevant field wherein, generate unique URL of described webpage.
Finally, the irrelevant field in each URL obtained after field being split and analyzing is deleted, and generates unique URL of same webpage.
With above-mentioned example, above-mentioned example is after irrelevant field is deleted, and unique URL of generation is http://www.a.com/b cid=1, can obtain <docid thus, uniq_url> data.
The present embodiment passes through such scheme, by identifying the various different URL forms of same webpage, the various URL of same webpage are converted to unique URL by field irrelevant with web page contents in automatic fitration URL, and can reach the standard grade after automatic analysis within a short period of time and come into force, the memory space of effective minimizing web data, and improve web page access quantitative statistics efficiency and accuracy, it compares traditional scheme, does not need manual intervention, and it is extensive to cover website, result of calculation is accurate.
As shown in Figure 2, second embodiment of the invention proposes a kind of webpage url filtering method, on the basis of above-mentioned first embodiment, after above-mentioned steps S103, also comprises:
Step S104, verifies unique URL of the described webpage generated.
The difference of the present embodiment and above-mentioned first embodiment is, the present embodiment also comprises the scheme verified the validity of unique URL of the webpage generated, and filters and makes mistakes, improve web data statistical accuracy to avoid url field.
Particularly, the unique URL of the present embodiment to the same webpage generated verifies and specifically adopts following methods:
Open the webpage that this unique URL is corresponding in a browser, obtain the web page contents that returns, then using this web page contents as character string, bring default hash algorithm into, calculate and obtain web document numbering (docid) corresponding to described web page contents.Wherein, the hash algorithm preset can adopt MD5 algorithm or other similar algorithms etc.
Afterwards, web document corresponding for the described web page contents obtained numbering is contrasted with the original web page document code of described unique URL; If both are identical, then the Field Sanitization before representing is correct, and unique URL of generation is effective, still can access parent page, otherwise represents that Field Sanitization is made mistakes, and the unique URL generated before judgement is invalid.
Thus, by such scheme, the validity of unique URL of the webpage generated is verified, get rid of url field and filter the situation of makeing mistakes, thus improve web data statistical accuracy.
As shown in Figure 3, third embodiment of the invention proposes a kind of webpage url filtering method, on the basis of above-mentioned second embodiment, after above-mentioned steps S104, also comprises:
Step S105, when judging that described unique URL is invalid through verification, reports Field Sanitization error message to data monitoring platform.
The difference of the present embodiment and above-mentioned second embodiment is, the present embodiment also comprise to verification judge unique URL invalid after processing scheme.
Particularly, when the unique URL generated before judging through verification is invalid, Field Sanitization error message is reported to data monitoring platform, to adopt corresponding measure, such as, again field fractionation and analysis are carried out to URL, or, the url data that browser backstage provides is checked.
Thus, by such scheme, when the unique URL generated is invalid, reported Field Sanitization error message to data monitoring platform, and further increased web data statistical accuracy before judging through verification.
As shown in Figure 4, first embodiment of the invention proposes the device of a kind of filtering web page URL, comprising: url data acquisition module 201, field split and analysis module 202 and generation module 203, wherein:
Url data acquisition module 201, for obtaining the url data collection of the same webpage collected in advance;
Field splits and analysis module 202, when comprising multiple URL for concentrating when described url data, carries out field fractionation and analysis to each URL that described url data is concentrated;
Generation module 203, for splitting and analysis result according to field, removing irrelevant field wherein, generating unique URL of described webpage.
The present embodiment can robotization realize identifying and filtering web page URL nuisance parameter, various URL corresponding for same webpage is reduced to its most original mode, to improve web page access quantitative statistics accuracy, and reduces the memory space of web data.
Particularly, first the url data collection of the same webpage collected in advance is obtained by url data acquisition module 201, as a kind of embodiment, can number (docid) obtains the same webpage that this browser backstage is collected url data from browser backstage based on web document, this url data can be expressed as: <docid, url>, wherein, docid is the document code of webpage, and it comprises 64 character strings, docid unique identification webpage.
Thus, each URL of this same webpage can be got from browser backstage by the docid of same webpage.Meanwhile, the original docid as this webpage also preserves by the docid of this same webpage, can as the use of subsequent check.
As previously mentioned, the URL of same webpage may have multiple, when the same webpage obtained from browser backstage url data concentrate comprise multiple URL time, then need to filter the URL of this same webpage, to remove irrelevant field wherein.
When carrying out filtration treatment to the URL of webpage, first to be split by field and analysis module 202 each URL to same webpage carries out field fractionation and analysis, identifying the irrelevant field in each URL under same docid.
As a kind of embodiment, the processing procedure of each URL of same webpage being carried out to field fractionation and analysis is as follows:
First to each URL that url data is concentrated, obtain separator wherein, by separator, field fractionation is carried out to each URL, obtain all fields in each URL.
Usual url field is mainly used " & ", " " etc. symbol separate, these separators therefore can be utilized each URL to be carried out each field splitting to obtain wherein.
Such as: the field that http://tech.3g.cn/ipnewscontent.php nid=2149583 & sid=008A263A71D obtains after splitting is " nid=2149583 " and " sid=008A263A71D ".
After field splits, then identify irrelevant field, the concrete method adopted is:
Analyze each field in each URL, if some fields occur in each URL, then think that it is mandatory field, if otherwise the not appearance at least one URL of certain field, then think that it is irrelevant field, after deleting it, any impact can not be brought on the former webpage of access.Illustrate with an example below, assuming that certain webpage exists 3 URL:
http://www.a.com/b?cid=1;
http://www.a.com/b?cid=1&sid=2;
http://www.a.com/b?cid=1&sid=2&from=systemmail;
Split through field and analyze, " cid=1 " field all occurs at each URL, then judge that it is mandatory field; And " sid=2 " and " from=systemmail " does not all occur at least 1 URL, therefore judge that it is irrelevant field.
Finally, the irrelevant field in each URL that generation module 203 obtains after field being split and analyzing is deleted, and generates unique URL of same webpage.
With above-mentioned example, above-mentioned example is after irrelevant field is deleted, and unique URL of generation is http://www.a.com/b cid=1, can obtain <docid thus, uniq_url> data.
The present embodiment passes through such scheme, by identifying the various different URL forms of same webpage, the various URL of same webpage are converted to unique URL by field irrelevant with web page contents in automatic fitration URL, and can reach the standard grade after automatic analysis within a short period of time and come into force, the memory space of effective minimizing web data, and improve web page access quantitative statistics efficiency and accuracy, it compares traditional scheme, does not need manual intervention, and it is extensive to cover website, result of calculation is accurate.
As shown in Figure 5, described field splits and analysis module 202 can comprise: separator acquiring unit 2021, split cells 2022 and analysis judging unit 2023, wherein:
Separator acquiring unit 2021, for each URL concentrated described url data, obtains separator wherein;
Split cells 2022, for carrying out field fractionation by described separator to each URL, obtains all fields in each URL;
Analyze judging unit 2023, for analyzing each field in each URL, when described field described url data concentrate each URL in all occur, then determine that it is mandatory field; When described field concentrates not appearance at least one URL at described url data, then determine that it is irrelevant field.
As shown in Figure 6, second embodiment of the invention proposes the device of a kind of filtering web page URL, on the basis of above-mentioned first embodiment, also comprises:
Correction verification module 203, for verifying unique URL of the described webpage generated;
Particularly, as shown in Figure 7, described correction verification module 203 specifically can comprise: web page contents acquiring unit 2031, computing unit 2032 and comparison judgment unit 2033, wherein:
Web page contents acquiring unit 2031, for URL unique described in redirect, obtains the web page contents returned;
Computing unit 2032, for according to described web page contents and based on hash algorithm, calculates and obtains web document numbering corresponding to described web page contents;
Comparison judgment unit 2033, for contrasting web document corresponding for the described web page contents obtained numbering with the original web page document code of described unique URL; If both are identical, then judge that described unique URL is effective, otherwise judge that described unique URL is invalid.
The difference of the present embodiment and above-mentioned first embodiment is, the present embodiment also comprises the scheme verified the validity of unique URL of the webpage generated, and filters and makes mistakes, improve web data statistical accuracy to avoid url field.
Particularly, the unique URL of the present embodiment to the same webpage generated verifies and specifically adopts following methods:
Open the webpage that this unique URL is corresponding in a browser, obtain the web page contents that returns, then using this web page contents as character string, bring default hash algorithm into, calculate and obtain web document numbering (docid) corresponding to described web page contents.Wherein, the hash algorithm preset can adopt MD5 algorithm or other similar algorithms etc.
Afterwards, web document corresponding for the described web page contents obtained numbering is contrasted with the original web page document code of described unique URL; If both are identical, then the Field Sanitization before representing is correct, and unique URL of generation is effective, still can access parent page, otherwise represents that Field Sanitization is made mistakes, and the unique URL generated before judgement is invalid.
Thus, by such scheme, the validity of unique URL of the webpage generated is verified, get rid of url field and filter the situation of makeing mistakes, thus improve web data statistical accuracy.
As shown in Figure 8, third embodiment of the invention proposes the device of a kind of filtering web page URL, on the basis of above-mentioned second embodiment, also comprises:
Reporting module 204, for when described correction verification module judges that described unique URL is invalid, reports Field Sanitization error message to data monitoring platform.
The difference of the present embodiment and above-mentioned second embodiment is, the present embodiment also comprise to verification judge unique URL invalid after processing scheme.
Particularly, when the unique URL generated before judging through verification is invalid, Field Sanitization error message is reported to data monitoring platform, to adopt corresponding measure, such as, again field fractionation and analysis are carried out to URL, or, the url data that browser backstage provides is checked.
Thus, by such scheme, when the unique URL generated is invalid, reported Field Sanitization error message to data monitoring platform, and further increased web data statistical accuracy before judging through verification.
As shown in Figure 9, present pre-ferred embodiments proposes the system of a kind of filtering web page URL, the data monitoring platform 302 comprising browser 301 and communicate to connect with described browser 301, wherein:
Described browser 301 can comprise the device described in above-described embodiment;
Described data monitoring platform 302, the Field Sanitization error message reported when the unique URL for judging to generate when described browser 301 is invalid.
Particularly, the present embodiment can robotization realize identifying and filtering web page URL nuisance parameter, various URL corresponding for same webpage is reduced to its most original mode, to improve web page access quantitative statistics accuracy, and reduces the memory space of web data.
First browser 301 obtains the url data collection of the same webpage collected in advance, as a kind of embodiment, (docid) obtains the same webpage that this browser 301 backstage is collected url data from browser 301 backstage can be numbered based on web document, this url data can be expressed as: <docid, url>, wherein, docid is the document code of webpage, it comprises 64 character strings, docid unique identification webpage.
Thus, each URL of this same webpage can be got from browser 301 backstage by the docid of same webpage.Meanwhile, the original docid as this webpage also preserves by the docid of this same webpage, can as the use of subsequent check.
As previously mentioned, the URL of same webpage may have multiple, when the same webpage obtained from browser 301 backstage url data concentrate comprise multiple URL time, then need to filter the URL of this same webpage, to remove irrelevant field wherein.
When carrying out filtration treatment to the URL of webpage, first field fractionation and analysis being carried out to each URL of same webpage, identifying the irrelevant field in each URL under same docid.
As a kind of embodiment, the processing procedure of each URL of same webpage being carried out to field fractionation and analysis is as follows:
First to each URL that url data is concentrated, obtain separator wherein, by separator, field fractionation is carried out to each URL, obtain all fields in each URL.
Usual url field is mainly used " & ", " " etc. symbol separate, these separators therefore can be utilized each URL to be carried out each field splitting to obtain wherein.
Such as: the field that http://tech.3g.cn/ipnewscontent.php nid=2149583 & sid=008A263A71D obtains after splitting is " nid=2149583 " and " sid=008A263A71D ".
After field splits, then identify irrelevant field, the concrete method adopted is:
Analyze each field in each URL, if some fields occur in each URL, then think that it is mandatory field, if otherwise the not appearance at least one URL of certain field, then think that it is irrelevant field, after deleting it, any impact can not be brought on the former webpage of access.Illustrate with an example below, assuming that certain webpage exists 3 URL:
http://www.a.com/b?cid=1;
http://www.a.com/b?cid=1&sid=2;
http://www.a.com/b?cid=1&sid=2&from=systemmail;
Split through field and analyze, " cid=1 " field all occurs at each URL, then judge that it is mandatory field; And " sid=2 " and " from=systemmail " does not all occur at least 1 URL, therefore judge that it is irrelevant field.
Finally, the irrelevant field in each URL obtained after field being split and analyzing is deleted, and generates unique URL of same webpage.
With above-mentioned example, above-mentioned example is after irrelevant field is deleted, does is unique URL of generation http://www.a.com/b? cid=1, can obtain <docid thus, uniq_url> data.
The present embodiment passes through such scheme, by identifying the various different URL forms of same webpage, the various URL of same webpage are converted to unique URL by field irrelevant with web page contents in automatic fitration URL, and can reach the standard grade after automatic analysis within a short period of time and come into force, the memory space of effective minimizing web data, and improve web page access quantitative statistics accuracy, it compares traditional scheme, does not need manual intervention, and it is extensive to cover website, result of calculation is accurate.
Further, browser 301 can also verify unique URL of the described webpage generated.Specifically can adopt following methods:
In browser 301, open the webpage that this unique URL is corresponding, obtain the web page contents that returns, then using this web page contents as character string, bring default hash algorithm into, calculate and obtain web document numbering (docid) corresponding to described web page contents.Wherein, the hash algorithm preset can adopt MD5 algorithm or other similar algorithms etc.
Afterwards, web document corresponding for the described web page contents obtained numbering is contrasted with the original web page document code of described unique URL; If both are identical, then the Field Sanitization before representing is correct, and unique URL of generation is effective, still can access parent page, otherwise represents that Field Sanitization is made mistakes, and the unique URL generated before judgement is invalid.
Thus, by such scheme, the validity of unique URL of the webpage generated is verified, get rid of url field and filter the situation of makeing mistakes, thus improve web data statistical accuracy.
Further, when the unique URL generated before judging through verification is invalid, browser 301 reports Field Sanitization error message to data monitoring platform 302, to adopt corresponding measure, such as, again field fractionation and analysis are carried out to URL, or, the url data that browser 301 backstage provides is checked.
Thus, by such scheme, when the unique URL generated is invalid, reported Field Sanitization error message to data monitoring platform 302, and further increased web data statistical accuracy before judging through verification.
It should be noted that, in this article, term " comprises ", " comprising " or its any other variant are intended to contain comprising of nonexcludability, thus make to comprise the process of a series of key element, method, article or device and not only comprise those key elements, but also comprise other key elements clearly do not listed, or also comprise by the intrinsic key element of this process, method, article or device.When not more restrictions, the key element limited by statement " comprising ... ", and be not precluded within process, method, article or the device comprising this key element and also there is other identical element.
The invention described above embodiment sequence number, just to describing, does not represent the quality of embodiment.
Through the above description of the embodiments, those skilled in the art can be well understood to the mode that above-described embodiment method can add required general hardware platform by software and realize, hardware can certainly be passed through, but in a lot of situation, the former is better embodiment.Based on such understanding, technical scheme of the present invention can embody with the form of software product the part that prior art contributes in essence in other words, this computer software product is stored in a storage medium (as ROM/RAM, magnetic disc, CD), comprising some instructions in order to make a station terminal equipment (can be mobile phone, computing machine, server, or the network equipment etc.) perform method described in each embodiment of the present invention.Particularly, the programmed instruction corresponding to system of the device of the filtering web page URL described in Fig. 4-Fig. 8 and the filtering web page URL described in Fig. 9 can be stored in the readable storage medium storing program for executing of computing machine, server and other-end, and performed by least one processor wherein, to realize the webpage url filtering method described in Fig. 1 to Fig. 3.
The foregoing is only the preferred embodiments of the present invention; not thereby the scope of the claims of the present invention is limited; every utilize instructions of the present invention and accompanying drawing content to do equivalent structure or flow process conversion; or be directly or indirectly used in other relevant technical field, be all in like manner included in scope of patent protection of the present invention.

Claims (12)

1. a webpage url filtering method, is characterized in that, comprising:
Obtain the url data collection of the same webpage collected in advance;
When described url data concentrate comprise multiple URL time, to described url data concentrate each URL carry out field fractionation and analysis;
Split and analysis result according to field, remove irrelevant field wherein, generate unique URL of described webpage.
2. method according to claim 1, is characterized in that, the step of the url data collection of the same webpage that described acquisition is collected in advance comprises:
Obtain the url data of the same webpage that described browser backstage is collected from browser backstage based on web document numbering.
3. method according to claim 1, is characterized in that, the step that described each URL concentrated url data carries out field fractionation and analysis comprises:
To each URL that described url data is concentrated, obtain separator wherein;
By described separator, field fractionation is carried out to each URL, obtain all fields in each URL;
Analyze each field in each URL, when described field described url data concentrate each URL in all occur, then determine that it is mandatory field; When described field concentrates not appearance at least one URL at described url data, then determine that it is irrelevant field.
4. according to the method in claim 2 or 3, it is characterized in that, also comprise:
Unique URL of the described webpage generated is verified.
5. method according to claim 4, is characterized in that, the step that unique URL of the described described webpage to generating verifies comprises:
Unique URL described in redirect, obtains the web page contents returned;
According to described web page contents and based on hash algorithm, calculate and obtain web document numbering corresponding to described web page contents;
Web document corresponding for the described web page contents obtained numbering is contrasted with the original web page document code of described unique URL; If both are identical, then judge that described unique URL is effective, otherwise judge that described unique URL is invalid.
6. method according to claim 5, is characterized in that, also comprises:
When judging that described unique URL is invalid, report Field Sanitization error message to data monitoring platform.
7. a device of filtering web page URL, is characterized in that, comprising:
Url data acquisition module, for obtaining the url data collection of the same webpage collected in advance;
Field splits and analysis module, when comprising multiple URL for concentrating when described url data, carries out field fractionation and analysis to each URL that described url data is concentrated;
Generation module, for splitting and analysis result according to field, removing irrelevant field wherein, generating unique URL of described webpage.
8. device according to claim 7, is characterized in that,
Described url data acquisition module, also for obtaining the url data of the same webpage that described browser backstage is collected from browser backstage based on web document numbering.
9. device according to claim 8, is characterized in that, described field splits and analysis module comprises:
Separator acquiring unit, for each URL concentrated described url data, obtains separator wherein;
Split cells, for carrying out field fractionation by described separator to each URL, obtains all fields in each URL;
Analyze judging unit, for analyzing each field in each URL, when described field described url data concentrate each URL in all occur, then determine that it is mandatory field; When described field concentrates not appearance at least one URL at described url data, then determine that it is irrelevant field.
10. device according to claim 8 or claim 9, is characterized in that, also comprise:
Correction verification module, for verifying unique URL of the described webpage generated; Described correction verification module specifically comprises:
Web page contents acquiring unit, for URL unique described in redirect, obtains the web page contents returned;
Computing unit, for according to described web page contents and based on hash algorithm, calculates and obtains web document numbering corresponding to described web page contents;
Comparison judgment unit, for contrasting web document corresponding for the described web page contents obtained numbering with the original web page document code of described unique URL; If both are identical, then judge that described unique URL is effective, otherwise judge that described unique URL is invalid.
11. devices according to claim 10, is characterized in that, also comprise:
Reporting module, for when described correction verification module judges that described unique URL is invalid, reports Field Sanitization error message to data monitoring platform.
The system of 12. 1 kinds of filtering web page URL, is characterized in that, the data monitoring platform comprising browser and communicate to connect with described browser, wherein:
Described browser comprises the device according to any one of claim 7-11;
Described data monitoring platform, the Field Sanitization error message reported when the unique URL for judging to generate when described browser is invalid.
CN201310547585.4A 2013-11-06 2013-11-06 Webpage URL filtering method, device and system Pending CN104636340A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201310547585.4A CN104636340A (en) 2013-11-06 2013-11-06 Webpage URL filtering method, device and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201310547585.4A CN104636340A (en) 2013-11-06 2013-11-06 Webpage URL filtering method, device and system

Publications (1)

Publication Number Publication Date
CN104636340A true CN104636340A (en) 2015-05-20

Family

ID=53215112

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310547585.4A Pending CN104636340A (en) 2013-11-06 2013-11-06 Webpage URL filtering method, device and system

Country Status (1)

Country Link
CN (1) CN104636340A (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106227741A (en) * 2016-07-12 2016-12-14 国家计算机网络与信息安全管理中心 A kind of extensive URL matching process based on multilevel hash index chained list
CN106412054A (en) * 2016-09-27 2017-02-15 网宿科技股份有限公司 Naming method for converting dynamic network address into static network address, system and application thereof
CN106919570A (en) * 2015-12-24 2017-07-04 国家新闻出版广电总局广播科学研究院 The page link duplicate removal scan method and device of a kind of network-oriented new media
CN106953937A (en) * 2016-11-16 2017-07-14 阿里巴巴集团控股有限公司 A kind of uniform resource position mark URL conversion method and device
CN106970917A (en) * 2016-01-13 2017-07-21 中国科学院声学研究所 A kind of foundation of blacklist URL Hash table and the lookup method of request URL

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060218143A1 (en) * 2005-03-25 2006-09-28 Microsoft Corporation Systems and methods for inferring uniform resource locator (URL) normalization rules
CN102110132A (en) * 2010-12-08 2011-06-29 北京星网锐捷网络技术有限公司 Uniform resource locator matching and searching method, device and network equipment
CN102663105A (en) * 2012-04-13 2012-09-12 北京搜狗科技发展有限公司 Establishing method and system of number information database
CN102682085A (en) * 2012-04-18 2012-09-19 北京十分科技有限公司 Method for removing duplicated web page

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060218143A1 (en) * 2005-03-25 2006-09-28 Microsoft Corporation Systems and methods for inferring uniform resource locator (URL) normalization rules
CN102110132A (en) * 2010-12-08 2011-06-29 北京星网锐捷网络技术有限公司 Uniform resource locator matching and searching method, device and network equipment
CN102663105A (en) * 2012-04-13 2012-09-12 北京搜狗科技发展有限公司 Establishing method and system of number information database
CN102682085A (en) * 2012-04-18 2012-09-19 北京十分科技有限公司 Method for removing duplicated web page

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106919570A (en) * 2015-12-24 2017-07-04 国家新闻出版广电总局广播科学研究院 The page link duplicate removal scan method and device of a kind of network-oriented new media
CN106970917A (en) * 2016-01-13 2017-07-21 中国科学院声学研究所 A kind of foundation of blacklist URL Hash table and the lookup method of request URL
CN106970917B (en) * 2016-01-13 2019-11-19 中国科学院声学研究所 A kind of foundation of the Hash table of blacklist URL and the lookup method of request URL
CN106227741A (en) * 2016-07-12 2016-12-14 国家计算机网络与信息安全管理中心 A kind of extensive URL matching process based on multilevel hash index chained list
CN106227741B (en) * 2016-07-12 2019-08-30 国家计算机网络与信息安全管理中心 A kind of extensive URL matching process based on multilevel hash index chained list
CN106412054A (en) * 2016-09-27 2017-02-15 网宿科技股份有限公司 Naming method for converting dynamic network address into static network address, system and application thereof
CN106412054B (en) * 2016-09-27 2019-05-24 网宿科技股份有限公司 Dynamic web addresses are converted to naming method, system and its application of static network address
CN106953937A (en) * 2016-11-16 2017-07-14 阿里巴巴集团控股有限公司 A kind of uniform resource position mark URL conversion method and device
CN106953937B (en) * 2016-11-16 2020-06-02 阿里巴巴集团控股有限公司 Uniform Resource Locator (URL) conversion method and device

Similar Documents

Publication Publication Date Title
CN103095681B (en) A kind of method and device detecting leak
CN107943838B (en) Method and system for automatically acquiring xpath generated crawler script
CN103888490B (en) A kind of man-machine knowledge method for distinguishing of full automatic WEB client side
CN103179132B (en) A kind of method and device detecting and defend CC attack
CN106649810B (en) The grasping means and system of news web page dynamic data based on Ajax
CN103559235B (en) A kind of online social networks malicious web pages detection recognition methods
US20150324478A1 (en) Detection method and scanning engine of web pages
CN103927370B (en) Network information batch acquisition method of combined text and picture information
CN107241296B (en) Webshell detection method and device
CN102799814B (en) A kind of fishing website seeking system and method
CN104636340A (en) Webpage URL filtering method, device and system
CN103279507B (en) Webpage spider operational method and system
WO2015074503A1 (en) Statistical method and apparatus for webpage access data
CN102710795B (en) Hotspot collecting method and device
CN103455600B (en) A kind of video URL grasping means, device and server apparatus
CN102663000A (en) Establishment method for malicious website database, method and device for identifying malicious website
CN106095979A (en) URL merging treatment method and apparatus
CN102984161B (en) The recognition methods of a kind of reliable website and device
CN102833233B (en) Method and device for recognizing web pages
CN103218410A (en) Internet event analysis method and device
CN102880830A (en) Acquisition method and device of original test data
CN102571922B (en) Method and device for processing data stream
CN112989348A (en) Attack detection method, model training method, device, server and storage medium
CN103220277B (en) The monitoring method of cross-site scripting attack, Apparatus and system
CN103870752A (en) Method and device for detecting Flash XSS (Cross Site Script) vulnerabilities and equipment

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20150520