CN105302815A - Web page uniform resource locator URL filtering method and apparatus - Google Patents

Web page uniform resource locator URL filtering method and apparatus Download PDF

Info

Publication number
CN105302815A
CN105302815A CN201410284750.6A CN201410284750A CN105302815A CN 105302815 A CN105302815 A CN 105302815A CN 201410284750 A CN201410284750 A CN 201410284750A CN 105302815 A CN105302815 A CN 105302815A
Authority
CN
China
Prior art keywords
url
configuration file
current url
mentioned
described current
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201410284750.6A
Other languages
Chinese (zh)
Other versions
CN105302815B (en
Inventor
何双宁
董昭
马杰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN201410284750.6A priority Critical patent/CN105302815B/en
Publication of CN105302815A publication Critical patent/CN105302815A/en
Application granted granted Critical
Publication of CN105302815B publication Critical patent/CN105302815B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Abstract

The present invention discloses a web page uniform resource locator URL filtering method and apparatus. The method comprises: obtaining a to-be-processed URL set, wherein the to-be-processed URL set comprises a plurality of to-be-processed web page URLs; and performing the following filtering operation on each URL in the to-be-processed URL set, wherein a URL currently subjected to the filtering operation in the to-be-processed URL set is a current URL: determining whether the current URL is a to-be-tested URL according to a filtering identifier in a preset configuration file; if the URL is a to-be-tested URL, matching the current URL according to a filtering field in the configuration file; and if the current URL is successfully matched according to the filtering field, filtering the current URL out of the to-be-processed URL set. The method and apparatus provided by the present invention solve a technical problem in the prior art that URLs of spam web pages cannot be filtered, so that Web security scanning is performed after URLs of spam web pages are filtered out, thereby improving the efficiency of Web security scanning.

Description

The filter method of the uniform resource position mark URL of webpage and device
Technical field
The present invention relates to computer realm, in particular to a kind of filter method and device of uniform resource position mark URL of webpage.
Background technology
When carrying out webpage Web security sweep to CGI (Common gateway interface) (CGI, CommonGatewayInterface), usually need to collect all CGI as much as possible, and filter out the rubbish page wherein, improve the efficiency of Web security sweep.At present, it is by web crawlers that the method that those skilled in the art gather CGI usually mainly comprises following two kinds: one, crawls URL on the internet; Two is obtain CGI by the flow of bypass WAF.But, the method of above-mentioned these two kinds acquisition CGI, all inevitably collect a lot of spam page, wherein, above-mentioned spam page can be can not access or non-existent webpage, these spam pages, meaningless to Web security sweep, have impact on the efficiency of Web security sweep even to a great extent.
Along with the quantity of the CGI collected constantly increases, the spam page collected by above-mentioned CGI acquisition method is also increased thereupon, like this, in the process of webpage Web security sweep, spam page is filtered out rapidly from the URL of magnanimity, and filter out URL corresponding to spam page, just become very important.
But, for above-mentioned problem, at present effective solution is not yet proposed.
Summary of the invention
Embodiments provide a kind of filter method and device of uniform resource position mark URL of webpage, at least to solve because prior art cannot the technical matters of URL of filtering spam webpage.
According to an aspect of the embodiment of the present invention, provide a kind of filter method of uniform resource position mark URL of webpage, comprising: obtain pending URL and gather, wherein, above-mentioned pending URL set comprises the URL of multiple pending webpage; Following filter operation is performed to each URL in above-mentioned pending URL set, wherein, in above-mentioned pending URL set, below current execution, the URL of filter operation is current URL: judge whether above-mentioned current URL is URL to be detected according to the filter identifier in the configuration file preset; If above-mentioned URL is above-mentioned URL to be detected, then according to the filtered fields in above-mentioned configuration file, above-mentioned current URL is mated; If successfully mate above-mentioned current URL according to above-mentioned filtered fields, then from above-mentioned pending URL set, filter out above-mentioned current URL.
According to the another aspect of the embodiment of the present invention, additionally providing a kind of filtration unit of uniform resource position mark URL of webpage, comprising: acquiring unit, gathering for obtaining pending URL, wherein, above-mentioned pending URL set comprises the URL of multiple pending webpage; Filter element, for performing following filter operation to each URL in above-mentioned pending URL set, wherein, in above-mentioned pending URL set, below current execution, the URL of filter operation is current URL: judge whether above-mentioned current URL is URL to be detected according to the filter identifier in the configuration file preset; When above-mentioned URL is above-mentioned URL to be detected, according to the filtered fields in above-mentioned configuration file, above-mentioned current URL is mated; When successfully mating above-mentioned current URL according to above-mentioned filtered fields, from above-mentioned pending URL set, filter out above-mentioned current URL.
In embodiments of the present invention, by utilizing configuration file, the pending URL obtained is filtered, wherein, at least filter identifier is comprised in above-mentioned configuration file, filtered fields, judge by utilizing filter identifier whether above-mentioned pending URL is URL to be detected, to reach the object of above-mentioned URL being carried out to preliminary screening, then by filtered fields, URL to be detected is mated, and then the URL of successful match is filtered, thus achieve in the process of Web security sweep, no longer the URL corresponding to unnecessary spam page is scanned, thus achieve the efficiency improving Web security sweep.And then solve because prior art cannot the technical matters of URL of filtering spam webpage.
In addition, by utilizing characteristic parameter in filtered fields and/or feature string, above-mentioned URL to be detected is mated according to predetermined matching way, reach the object of the accurate filtration to URL, thus achieve the technique effect improved the accuracy of the filtration of the uniform resource position mark URL of webpage.
Accompanying drawing explanation
Accompanying drawing described herein is used to provide a further understanding of the present invention, and form a application's part, schematic description and description of the present invention, for explaining the present invention, does not form inappropriate limitation of the present invention.In the accompanying drawings:
Fig. 1 is the hardware environment schematic diagram of the filter method of the uniform resource position mark URL of a kind of optional applying web page according to the embodiment of the present invention;
Fig. 2 is the process flow diagram of the filter method of the uniform resource position mark URL of a kind of optional webpage according to the embodiment of the present invention;
Fig. 3 is the process flow diagram of the method for the uniform resource position mark URL of a kind of optional acquisition webpage according to the embodiment of the present invention;
Fig. 4 is the schematic diagram of the configuration file in the filter method of the uniform resource position mark URL of a kind of optional webpage according to the embodiment of the present invention;
Fig. 5 is the process flow diagram of the filter method of uniform resource position mark URL according to the optional webpage of the another kind of the embodiment of the present invention;
Fig. 6 is the schematic diagram according to the configuration file in the filter method of the uniform resource position mark URL of the optional webpage of the another kind of the embodiment of the present invention;
Fig. 7 is the schematic diagram of the filtration unit of the uniform resource position mark URL of a kind of optional webpage according to the embodiment of the present invention; And
Fig. 8 is the schematic diagram of the server of the filter method of the uniform resource position mark URL of a kind of optional applying web page according to the embodiment of the present invention.
Embodiment
The present invention program is understood better in order to make those skilled in the art person, below in conjunction with the accompanying drawing in the embodiment of the present invention, technical scheme in the embodiment of the present invention is clearly and completely described, obviously, described embodiment is only the embodiment of a part of the present invention, instead of whole embodiments.Based on the embodiment in the present invention, those of ordinary skill in the art, not making the every other embodiment obtained under creative work prerequisite, should belong to the scope of protection of the invention.
It should be noted that, term " first ", " second " etc. in instructions of the present invention and claims and above-mentioned accompanying drawing are for distinguishing similar object, and need not be used for describing specific order or precedence.Should be appreciated that the data used like this can be exchanged in the appropriate case, so as embodiments of the invention described herein can with except here diagram or describe those except order implement.In addition, term " comprises " and " having " and their any distortion, intention is to cover not exclusive comprising, such as, contain those steps or unit that the process of series of steps or unit, method, system, product or equipment is not necessarily limited to clearly list, but can comprise clearly do not list or for intrinsic other step of these processes, method, product or equipment or unit.
Embodiment 1
According to the embodiment of the present invention, provide a kind of filter method of uniform resource position mark URL of webpage, the filter method of the uniform resource position mark URL of above-mentioned webpage can be applied in hardware environment as shown in Figure 1, wherein, can be established the link by the web page server 104 at network and above-mentioned webpage place for performing to the uniform resource position mark URL of webpage the filtering server 102 filtered, and the pending URL sent by above-mentioned web page server 104 is filtered.Wherein, above-mentioned network includes but not limited to: wide area network, Metropolitan Area Network (MAN) or LAN (Local Area Network).
Alternatively, as shown in Figure 2, the filter method of the URL of the webpage in the present embodiment comprises:
S202, obtains pending URL and gathers, and wherein, pending URL set comprises the URL of multiple pending webpage;
S204, performs following filter operation to each URL in pending URL set, and wherein, in pending URL set, below current execution, the URL of filter operation is current URL:
According to the filter identifier in the configuration file preset, S2042, judges whether current URL is URL to be detected;
S2044, if URL is URL to be detected, then mates current URL according to the filtered fields in configuration file;
S2046, if successfully mate current URL according to filtered fields, then filters out current URL from pending URL set;
S2048, if URL is not URL to be detected, or, if do not carry out successful match to current URL according to filtered fields, then from pending URL set, do not filter out current URL.
Alternatively, in the present embodiment, the filter method of the uniform resource position mark URL of above-mentioned webpage can be applied in the process of Web security sweep.Such as, shown in composition graphs 1, before execution is to above-mentioned Web security sweep, obtain above-mentioned pending URL to gather, wherein, above-mentioned pending URL set comprises the URL of multiple pending webpage, performs filter operation to each URL in above-mentioned URL set, to make the URL filtered out the URL of the magnanimity obtained from filtering server 102 corresponding to the spam page of unnecessary execution Web security sweep.Above-mentioned citing is a kind of example, and the present embodiment does not do any restriction to this.
Alternatively, in the present embodiment, shown in composition graphs 3, before obtaining pending URL set, the reciprocal process between filtering server 102 and web page server 104:
S302, filtering server 102 can send by network the request obtaining pending URL set to web page server 104;
S304, responds above-mentioned request above-mentioned web page server 104 and can return pending URL set to filtering server 102.
Alternatively, in the present embodiment, above-mentioned configuration file is the file formed by the json character string comprising filter identifier and filtered fields, wherein, json is a kind of data interchange language JavaScriptObjectNotation of lightweight, above-mentioned language based on word, and is easy to allow people read, and also facilitates machine simultaneously and carries out resolving and generating.Wherein, above-mentioned filter identifier can include but not limited to: perform the scope of application of filtering to above-mentioned pending URL set.Such as, the above-mentioned scope of application can include but not limited to: overall webpage, local webpage, wherein, above-mentioned local webpage can be screened by the mode of default domain name.Above-mentioned filtered fields can include but not limited to: indicate and perform to above-mentioned URL to be detected the matching result filtered, wherein, can include but not limited to multiple filtration son field in above-mentioned filtered fields.Such as, above-mentioned matching result can include but not limited to: for the characteristic parameter that mates and matching way thereof, for the feature string that mates and matching way thereof.
Such as, as shown in Fig. 4 402, identify above-mentioned filter identifier with " host ", when the value of above-mentioned " host " is " * ", then represent that above-mentioned filtration is applicable to the filtration to all webpages; When the value of above-mentioned " host " is " domain name/IP ", then represent that above-mentioned filtration is applicable to correspond to the webpage of above-mentioned " domain name/IP ".When judging that the current URL of above-mentioned current execution filter operation meets above-mentioned filter identifier, then judge that above-mentioned current URL is URL to be detected.
Again such as, as shown in Fig. 4 404, identify above-mentioned filtered fields with " rule ", wherein, son field as follows can be included but not limited in above-mentioned " rule ": the characteristic parameter that status code " HttpCode " 1) is set; 2) feature string of message text " Content " is set.Such as, the value of configuration file configuration status code " HttpCode " " equals " numerical value " 200 ", the character string of configuration messages text " Content " is " http://qzone.qq.com/gy/404/data.js ", when all son fields in above-mentioned URL to be detected and above-mentioned filtered fields, all the match is successful, the match is successful then can to judge above-mentioned URL to be detected, from above-mentioned pending URL set, filter out above-mentioned current URL.
Alternatively, in the present embodiment, above-mentioned configuration file can also include but not limited to: the typonym of configuration file, the attribute of configuration file, and wherein, the attribute of above-mentioned configuration file can include but not limited to: the interpolation time of configuration file, the adder of configuration file.Such as, as shown in Fig. 4 406, the typonym of configuration file is " gongyi404 ", and as shown in Fig. 4 408, the interpolation time of configuration file is " 2013-10-13 ", and the adder of configuration file is " zhangsan ".
Alternatively, in the present embodiment, after filtration is executed to above-mentioned pending URL, the URL filtered out corresponding to spam page is preserved, scans so that call during Web security sweep, reach the efficiency improving Web security sweep.
Alternatively, in the present embodiment, above-mentioned configuration file can be preserved with the form of Hash table, the position of preservation can be following one of at least: in disk text, in the file of database server.Alternatively, when needing to perform filtration to above-mentioned pending URL, just loading above-mentioned configuration file by above-mentioned position and realizing filtering the URL in above-mentioned pending URL set.Alternatively, in the present embodiment, the mode loading above-mentioned configuration file can be, but not limited to as in above-mentioned Hash table, traversal searches the configuration file corresponding with current URL.
Alternatively, above-mentioned default configuration file is multiple configuration file, wherein, judge whether current URL is URL to be detected by the following steps filter identifier performed in the default configuration file of basis, according to the filtered fields in configuration file, current URL mated, from pending URL set, filter out current URL: from multiple configuration file, search coupling configuration file, wherein, judge that current URL is URL to be detected and successfully mates current URL according to the filtered fields in coupling configuration file according to the filter identifier in coupling configuration file; As long as find out a coupling configuration file from multiple configuration file, then from pending URL set, filter out current URL.
Concrete composition graphs 1 and the process flow diagram shown in Fig. 5, illustrate the filtering process of the uniform resource position mark URL of above-mentioned webpage, and suppose that URL set comprises URL_1, URL_2, URL_3, URL_4, URL_5, multiple configuration file is P_1, P_2, P_3 respectively.
As shown in Figure 1 and Figure 5, the filter method of the URL of the webpage in the present embodiment comprises:
S502, filtering server 102 is gathered to the URL that web page server 104 acquisition request is pending;
S504, web page server 104 returns pending URL set to filtering server 102;
S506, filtering server 102 judges whether to perform filter operation to all URL, and wherein, above-mentioned URL set comprises URL_1, URL_2, URL_3, also has URL not perform filter operation, then perform step S508 during above-mentioned URL gathers if judge; If judge, filter operation is crossed in the equal executed of URL in above-mentioned URL set, then terminate this and filter;
S508, filtering server 102 reads a URL (such as, reading URL_3), and wherein, above-mentioned URL (such as, URL_3) does not also perform filter operation;
S510, filtering server 102 is searched and is judged whether to perform matching operation to all configuration files, if judge, configuration file does not perform matching operation in addition, then perform step S512, matching operation (S514 is performed to all configuration files if judge, or, S514 and S516), then return S506; Wherein, when filtering server 102 first time judges whether to have mated all configuration files, first filtering server 102 loads all configuration files (such as, configuration file P_1, P_2, P_3);
S512, filtering server 102 reads a configuration file (such as, configuration file P_2), and wherein, the above-mentioned configuration file (such as, configuration file P_2) be read also does not perform matching operation;
S514, filtering server 102 judges whether current URL meets the matching result indicated by filter identifier, if meet, then performs step S516, otherwise, return again to search and judge new configuration file;
S516, current URL (such as, URL_3) mates with the son field in filtered fields by filtering server 102 successively;
S518, filtering server 102 judges whether successful match, if successful match, then judge that above-mentioned current URL (such as, URL_3) is for URL corresponding to spam page, perform step S520, if unsuccessful coupling, then return S510, judge whether to perform matching operation (such as, configuration file P_3 not yet performs matching operation) to all configuration files;
S520, filtering server 102 will filter out URL corresponding to above-mentioned spam page, and returns execution step S506, to judge whether to perform filter operation to all URL in URL set.
From the step S518 in said process, current URL (such as, URL_3) the unsuccessful coupling of configuration file P_2 of mating is being performed with current, then return again to search and judge new configuration file, such as from multiple configuration file, read the configuration file that does not also perform matching operation again, such as, this configuration file is configuration file P_3.If current URL (such as, URL_3) with current performing mate for configuration file P_2 successful match, namely configuration file P_2 is the coupling configuration file found out, then perform step S520, the above-mentioned current URL (such as, URL_3) be judged out corresponding to the webpage of spam page is filtered out.Then, return and perform step S506, judge whether to perform filter operation to all URL in URL set.
By embodiment provided by the invention, from the pending URL set of magnanimity, URL to be detected is filtered out by the filter identifier in default configuration file, further, the filtered fields in configuration file is utilized to mate in above-mentioned URL to be detected, to obtain the URL needing to filter out from above-mentioned pending URL set, and then achieve the URL in the URL set got corresponding to unnecessary spam page is effectively filtered, thus in the process of Web security sweep, without the need to scanning the URL corresponding to spam page, reach the effect of the efficiency improving Web security sweep.
As the optional scheme of one, according to the filter identifier in the configuration file preset, step S2042, judges whether current URL is that URL to be detected comprises:
1) if filter identifier is be used to indicate the field of filtering all webpages, then judge that current URL is URL to be detected; Or
2) if filter identifier is be used to indicate the field of filtering default domain name, then judge whether comprise default domain name in current URL, if current URL comprises default domain name, then judge that current URL is URL to be detected.
Specifically be described in conjunction with following example, as shown in Fig. 4 402, above-mentioned filter identifier is identified with " host ", when the value of above-mentioned " host " is " * ", then represent that above-mentioned filtration is applicable to the filtration to all webpages, then judge that the URL corresponding to above-mentioned all webpages is URL to be detected, above-mentioned URL to be detected is by the matching judgment after being used for performing.
Specifically be described in conjunction with following example, configuration file as shown in Figure 6, when the value of above-mentioned " host " 602 is " www.sina.com/168.1.1.3 ", then represent that above-mentioned filtration is applicable to corresponding webpage in above-mentioned Sina website.When judging that the current URL of above-mentioned current execution filter operation comprises above-mentioned default domain name, then judge that above-mentioned current URL is described URL to be detected.
Alternatively, in the present embodiment, if above-mentioned current URL does not meet the scope of filter identifier instruction, then illustrate that above-mentioned current URL is not the URL that spam page is corresponding, then no longer continue the judgement performing filter operation.
By embodiment provided by the invention, by filter identifier, above-mentioned pending URL set is screened, to obtain the URL to be detected in corresponding scope, and then in above-mentioned scope, url filtering corresponding for spam page is fallen, achieve and filter operation is performed to the URL in preset range, improve url filtering accuracy to reach.
As the optional scheme of one, step S2044, carries out coupling according to the filtered fields in configuration file to current URL and comprises:
S1, performs the matching operation indicated in filtered fields to current URL;
Whether S2, meet the matching result indicated in filtered fields judge whether successfully to mate current URL according to performing the result that obtains of matching operation.
Alternatively, in the present embodiment, the matching operation indicated in above-mentioned filtered fields includes but not limited to: characteristic parameter coupling, characteristic character String matching.Wherein, above-mentioned characteristic parameter can include but not limited to following one of at least: the status code of the webpage corresponding to current URL, and/or for the Content length field of the size that represents the webpage corresponding to current URL.Above-mentioned feature string can include but not limited to following one of at least: the partial character string in the link of current URL, current URL link.
Alternatively, in the present embodiment, the matching way of characteristic parameter and/or feature string and correspondence thereof can be set in above-mentioned filtered fields, to realize performing matching operation to described URL to be detected.Alternatively, in the present embodiment, the matching way of above-mentioned characteristic parameter can include but not limited to as the value that is greater than the value of above-mentioned characteristic parameter, is less than the value of above-mentioned characteristic parameter, equals above-mentioned characteristic parameter.In the present embodiment, the matching way of above-mentioned feature string can include but not limited to for: search coupling, canonical coupling.
Such as, the current URL being used for performing filter operation is mated with the matching result indicated in above-mentioned filtered fields, if the result that current URL execution matching operation obtains meets the matching result indicated in filtered fields, then judge successfully to mate current URL.
By embodiment provided by the invention, by utilizing filtered fields, matching operation is performed to URL to be detected, judge that whether above-mentioned URL to be detected is the URL that unnecessary spam page is corresponding further, achieve the URL corresponding to spam page to judge accurately, to realize the efficiency improving Web security sweep, reduce the cost of Web security sweep.
As the optional scheme of one, S1, perform to current URL the matching operation indicated in filtered fields to comprise: S10, judge the matched whether met between the characteristic parameter in current URL and the characteristic parameter in filtered fields in configuration file, wherein, characteristic parameter comprises: the status code of the webpage corresponding to current URL, and/or for the Content length field of the size that represents the webpage corresponding to current URL; S2, whether meet the matching result indicated in filtered fields judge whether that successfully carrying out coupling to current URL comprises: S20 according to performing the result that obtains of matching operation, if the characteristic parameter in current URL and when meeting in configuration file matched between the characteristic parameter in filtered fields, then judge successfully to mate current URL.
Specifically be described in conjunction with following example, shown in composition graphs 6, suppose to perform the matching operation that indicates in filtered fields for mate characteristic parameter to current URL, then judge whether the characteristic parameter in current URL meets predetermined matched with the characteristic parameter in filtered fields.Such as, characteristic parameter in above-mentioned filtered fields is status code " HttpCode ", predetermined matched be "=; 200 ", then judge that characteristic parameter in current URL (namely, status code " HttpCode ") whether equal 200, if judge, the status code " HttpCode " in current URL meets above-mentioned matched, then judge successfully to mate above-mentioned current URL.
As the optional scheme of one, S1, perform to current URL the matching operation indicated in filtered fields to comprise: S10, judge the matching condition whether met between the feature string in current URL and the feature string in filtered fields in configuration file, wherein, feature string comprise following one of at least: the partial character string in the link of current URL, current URL link; S2, whether meet the matching result indicated in filtered fields judge whether that successfully carrying out coupling to current URL comprises: S20 according to performing the result that obtains of matching operation, if the feature string in current URL and when meeting in configuration file matching condition between the character string in filtered fields, then judge successfully to mate current URL.
Specifically be described in conjunction with following example, shown in composition graphs 6, suppose to perform the matching operation that indicates in filtered fields for mate feature string to current URL, then judge whether the feature string in current URL meets predetermined matching condition with the feature string in filtered fields.Such as, feature string in above-mentioned filtered fields is message text " Content ", predetermined matching condition be " substr=; stc=" http://news.sina.com/gj/303/data.js " ", then judge that feature string in current URL (namely, message text " Content ") whether meet above-mentioned matching condition, the concatenation character string of the world news in the instruction Sina News such as shown in Fig. 6, if find above-mentioned feature string in current URL, then judge successfully to mate above-mentioned current URL.
Again such as, canonical matched is set, using the partial character string in complete character string as feature string, adopt the mode of canonical coupling, judge whether to comprise in above-mentioned current URL feature string set in canonical matched, to realize filtering the URL comprising certain specific partial character string, make the filtration to current URL in the present embodiment, can only filter for the URL comprising certain specific class character string.
By embodiment provided by the invention, by the characteristic parameter in current URL and/or feature string are mated according to predetermined matching way with characteristic parameter set in filtered fields and/or feature string, achieve the URL that the spam page accurately judged in above-mentioned URL to be detected is corresponding, thus URL corresponding for above-mentioned spam page is accurately filtered, and then Web security sweep is carried out to the URL after filtering, reach the effect of the efficiency improving Web security sweep.
As the optional scheme of one, in step S206, after filter operation is performed to each URL in pending URL set, also comprise:
S1, performs safe web page scan operation to the pending webpage indicated by each URL filtered out in the pending URL set of current URL.
Specifically be described in conjunction with following example, after filtering out URL corresponding to spam page, after filtering out URL corresponding to spam page in being gathered by above-mentioned URL, remaining URL preserves, and when when performing safe web page scan operation, directly calls the above-mentioned URL not containing spam page preserved.
By embodiment provided by the invention, by performing safe web page scan operation to the pending webpage indicated by each URL filtered out in the pending URL set of URL corresponding to spam page, thus achieve and avoid performing safe web page scan operation to URL corresponding to spam page, reach the effect improving Web security sweep.
It should be noted that, for aforesaid each embodiment of the method, in order to simple description, therefore it is all expressed as a series of combination of actions, but those skilled in the art should know, the present invention is not by the restriction of described sequence of movement, because according to the present invention, some step can adopt other orders or carry out simultaneously.Secondly, those skilled in the art also should know, the embodiment described in instructions all belongs to preferred embodiment, and involved action and module might not be that the present invention is necessary.
Through the above description of the embodiments, those skilled in the art can be well understood to the mode that can add required general hardware platform by software according to the method for above-described embodiment and realize, hardware can certainly be passed through, but in a lot of situation, the former is better embodiment.Based on such understanding, technical scheme of the present invention can embody with the form of software product the part that prior art contributes in essence in other words, this computer software product is stored in a storage medium (as ROM/RAM, magnetic disc, CD), comprising some instructions in order to make a station terminal equipment (can be mobile phone, computing machine, server, or the network equipment etc.) perform method described in each embodiment of the present invention.
Embodiment 2
According to the embodiment of the present invention, provide a kind of filtration unit of uniform resource position mark URL of webpage, the filtration unit of the uniform resource position mark URL of above-mentioned webpage can be applied in hardware environment as shown in Figure 1, wherein, said apparatus is arranged in for performing to the uniform resource position mark URL of webpage the filtering server 102 filtered, filtering server 102 can be established the link by the web page server 104 at network and above-mentioned webpage place, and filters the pending URL sent by above-mentioned web page server 104.Wherein, above-mentioned network includes but not limited to: wide area network, Metropolitan Area Network (MAN) or LAN (Local Area Network).
According to the embodiment of the present invention, additionally provide a kind of filtration unit of uniform resource position mark URL of webpage, as shown in Figure 7, this device comprises:
1) acquiring unit 702, gathers for obtaining pending URL, and wherein, pending URL set comprises the URL of multiple pending webpage;
2) filter element 704, for performing following filter operation to each URL in pending URL set, wherein, in pending URL set, below current execution, the URL of filter operation is current URL:
I) judge whether current URL is URL to be detected according to the filter identifier in the configuration file preset;
Ii), when URL is URL to be detected, according to the filtered fields in configuration file, current URL is mated;
When iii) successfully current URL being mated according to filtered fields, from pending URL set, filter out current URL.
Alternatively, in the present embodiment, the filter method of the uniform resource position mark URL of above-mentioned webpage can be applied in the process of Web security sweep.Such as, shown in composition graphs 1, before execution is to above-mentioned Web security sweep, obtain above-mentioned pending URL to gather, wherein, above-mentioned pending URL set comprises the URL of multiple pending webpage, performs filter operation to each URL in above-mentioned URL set, to make the URL filtered out the URL of the magnanimity obtained from filtering server 102 corresponding to the spam page of unnecessary execution Web security sweep.Above-mentioned citing is a kind of example, and the present embodiment does not do any restriction to this.
Alternatively, in the present embodiment, shown in composition graphs 3, before obtaining pending URL set, the reciprocal process between filtering server 102 and web page server 104:
S302, filtering server 102 can send by network the request obtaining pending URL set to web page server 104;
S304, responds above-mentioned request above-mentioned web page server 104 and can return pending URL set to filtering server 102.
Alternatively, in the present embodiment, above-mentioned configuration file is the file formed by the json character string comprising filter identifier and filtered fields, wherein, json is a kind of data interchange language JavaScriptObjectNotation of lightweight, above-mentioned language based on word, and is easy to allow people read, and also facilitates machine simultaneously and carries out resolving and generating.Wherein, above-mentioned filter identifier can include but not limited to: perform the scope of application of filtering to above-mentioned pending URL set.Such as, the above-mentioned scope of application can include but not limited to: overall webpage, local webpage, wherein, above-mentioned local webpage can be screened by the mode of default domain name.Above-mentioned filtered fields can include but not limited to: indicate and perform to above-mentioned URL to be detected the matching result filtered, wherein, can include but not limited to multiple filtration son field in above-mentioned filtered fields.Such as, above-mentioned matching result can include but not limited to: for the characteristic parameter that mates and matching way thereof, for the feature string that mates and matching way thereof.
Such as, as shown in Fig. 4 402, identify above-mentioned filter identifier with " host ", when the value of above-mentioned " host " is " * ", then represent that above-mentioned filtration is applicable to the filtration to all webpages; When the value of above-mentioned " host " is " domain name/IP ", then represent that above-mentioned filtration is applicable to correspond to the webpage of above-mentioned " domain name/IP ".When judging that the current URL of above-mentioned current execution filter operation meets above-mentioned filter identifier, then judge that above-mentioned current URL is URL to be detected.
Again such as, as shown in Fig. 4 404, identify above-mentioned filtered fields with " rule ", wherein, son field as follows can be included but not limited in above-mentioned " rule ": the characteristic parameter that status code " HttpCode " 1) is set; 2) feature string of message text " Content " is set.Such as, the value of configuration file configuration status code " HttpCode " " equals " numerical value " 200 ", the character string of configuration messages text " Content " is " http://qzone.qq.com/gy/404/data.js ", when all son fields in above-mentioned URL to be detected and above-mentioned filtered fields, all the match is successful, the match is successful then can to judge above-mentioned URL to be detected, from above-mentioned pending URL set, filter out above-mentioned current URL.
Alternatively, in the present embodiment, above-mentioned configuration file can also include but not limited to: the typonym of configuration file, the attribute of configuration file, and wherein, the attribute of above-mentioned configuration file can include but not limited to: the interpolation time of configuration file, the adder of configuration file.Such as, as shown in Fig. 4 406, the typonym of configuration file is " gongyi404 ", and as shown in Fig. 4 408, the interpolation time of configuration file is " 2013-10-13 ", and the adder of configuration file is " zhangsan ".
Alternatively, in the present embodiment, after filtration is executed to above-mentioned pending URL, the URL filtered out corresponding to spam page is preserved, scans so that call during Web security sweep, reach the efficiency improving Web security sweep.
Alternatively, in the present embodiment, above-mentioned configuration file can be preserved with the form of Hash table, the position of preservation can be following one of at least: in disk text, in the file of database server.Alternatively, when needing to perform filtration to above-mentioned pending URL, just loading above-mentioned configuration file by above-mentioned position and realizing filtering the URL in above-mentioned pending URL set.Alternatively, in the present embodiment, the mode loading above-mentioned configuration file can be, but not limited to as in above-mentioned Hash table, traversal searches the configuration file corresponding with current URL.
Alternatively, above-mentioned default configuration file is multiple configuration file, wherein, the filtration unit of the URL of above-mentioned webpage is performed by following steps and judges whether current URL is URL to be detected according to the filter identifier in the configuration file preset, according to the filtered fields in configuration file, current URL is mated, current URL is filtered out: from multiple configuration file, search coupling configuration file from pending URL set, wherein, judge that current URL is URL to be detected according to the filter identifier in coupling configuration file, and successfully current URL is mated according to the filtered fields in coupling configuration file, as long as find out a coupling configuration file from multiple configuration file, then from pending URL set, filter out current URL.
Concrete composition graphs 1 and the process flow diagram shown in Fig. 5, illustrate the filtering process of the uniform resource position mark URL of above-mentioned webpage, and suppose that URL set comprises URL_1, URL_2, URL_3, URL-4, URL_5, multiple configuration file is P_1, P_2, P_3 respectively.
S502, filtering server 102 is gathered to the URL that web page server 104 acquisition request is pending;
S504, web page server 104 returns pending URL set to filtering server 102;
S506, filtering server 102 judges whether to perform filter operation to all URL, and wherein, above-mentioned URL set comprises URL_1, URL_2, URL_3, also has URL not perform filter operation, then perform step S508 during above-mentioned URL gathers if judge; If judge, filter operation is crossed in the equal executed of URL in above-mentioned URL set, then terminate this and filter;
S508, filtering server 102 reads a URL (such as, reading URL_3), and wherein, above-mentioned URL (such as, URL_3) does not also perform filter operation;
S510, filtering server 102 is searched and is judged whether to perform matching operation to all configuration files, if judge, configuration file does not perform matching operation in addition, then perform step S512, matching operation (S514 is performed to all configuration files if judge, or, S514 and S516), then return S506; Wherein, when filtering server 102 first time judges whether to have mated all configuration files, first filtering server 102 loads all configuration files (such as, configuration file P_1, P_2, P_3);
S512, filtering server 102 reads a configuration file (such as, configuration file P_2), and wherein, the above-mentioned configuration file (such as, configuration file P_2) be read also does not perform matching operation;
S514, filtering server 102 judges whether current URL meets the matching result indicated by filter identifier, if meet, then performs step S516, otherwise, return again to search and judge new configuration file;
S516, current URL (such as, URL_3) mates with the son field in filtered fields by filtering server 102 successively;
S518, filtering server 102 judges whether successful match, if successful match, then judge that above-mentioned current URL (such as, URL_3) is for URL corresponding to spam page, perform step S520, if unsuccessful coupling, then return S510, judge whether to perform matching operation (such as, configuration file P_3 not yet performs matching operation) to all configuration files;
S520, filtering server 102 will filter out URL corresponding to above-mentioned spam page, and returns execution step S506, judges whether to perform filter operation to all URL in URL set.
By embodiment provided by the invention, from the pending URL set of magnanimity, URL to be detected is filtered out by the filter identifier in default configuration file, further, the filtered fields in configuration file is utilized to mate in above-mentioned URL to be detected, to obtain the URL needing to filter out from above-mentioned pending URL set, and then achieve the URL in the URL set got corresponding to unnecessary spam page is effectively filtered, thus in the process of Web security sweep, without the need to scanning the URL corresponding to spam page, reach the effect of the efficiency improving Web security sweep.
As the optional scheme of one, above-mentioned filter element 704 comprises:
1) the first judge module, for being be used to indicate the field of filtering all webpages in filter identifier, judges that current URL is URL to be detected; Or
2) the second judge module, for being be used to indicate the field of filtering default domain name in filter identifier, then judges whether comprise default domain name in current URL, if current URL comprises default domain name, judges that current URL is URL to be detected.
Specifically be described in conjunction with following example, as shown in Fig. 4 402, above-mentioned filter identifier is identified with " host ", when the value of above-mentioned " host " is " * ", then represent that above-mentioned filtration is applicable to the filtration to all webpages, then judge that the URL corresponding to above-mentioned all webpages is URL to be detected, above-mentioned URL to be detected is by the matching judgment after being used for performing.
Specifically be described in conjunction with following example, configuration file as shown in Figure 6, when the value of above-mentioned " host " 602 is " www.sina.com/168.1.1.3 ", then represent that above-mentioned filtration is applicable to corresponding webpage in above-mentioned Sina website.When judging that the current URL of above-mentioned current execution filter operation comprises above-mentioned default domain name, then judge that above-mentioned current URL is described URL to be detected.
Alternatively, in the present embodiment, if above-mentioned current URL does not meet the scope of filter identifier instruction, then illustrate that above-mentioned current URL is not the URL that spam page is corresponding, then no longer continue the judgement performing filter operation.
By embodiment provided by the invention, by filter identifier, above-mentioned pending URL set is screened, to obtain the URL to be detected in corresponding scope, and then in above-mentioned scope, url filtering corresponding for spam page is fallen, achieve and filter operation is performed to the URL in preset range, improve url filtering accuracy to reach.
As the optional scheme of one, above-mentioned filter element 704 comprises:
1) matching module, for performing the matching operation indicated in filtered fields to current URL;
2) the 3rd judge module, whether the result for obtaining according to execution matching operation meets the matching result indicated in filtered fields judges whether successfully to mate current URL.
Alternatively, in the present embodiment, the matching operation indicated in above-mentioned filtered fields includes but not limited to: characteristic parameter coupling, characteristic character String matching.Wherein, above-mentioned characteristic parameter can include but not limited to following one of at least: the status code of the webpage corresponding to current URL, and/or for the Content length field of the size that represents the webpage corresponding to current URL.Above-mentioned feature string can include but not limited to following one of at least: the partial character string in the link of current URL, current URL link.
Alternatively, in the present embodiment, the matching way of characteristic parameter and/or feature string and correspondence thereof can be set in above-mentioned filtered fields, to realize performing matching operation to described URL to be detected.Alternatively, in the present embodiment, the matching way of above-mentioned characteristic parameter can include but not limited to as the value that is greater than the value of above-mentioned characteristic parameter, is less than the value of above-mentioned characteristic parameter, equals above-mentioned characteristic parameter.In the present embodiment, the matching way of above-mentioned feature string can include but not limited to for: search coupling, canonical coupling.
Such as, the current URL being used for performing filter operation is mated with the matching result indicated in above-mentioned filtered fields, if the result that current URL execution matching operation obtains meets the matching result indicated in filtered fields, then judge successfully to mate current URL.
By embodiment provided by the invention, by utilizing filtered fields, matching operation is performed to URL to be detected, judge that whether above-mentioned URL to be detected is the URL that unnecessary spam page is corresponding further, achieve the URL corresponding to spam page to judge accurately, to realize the efficiency improving Web security sweep, reduce the cost of Web security sweep.
As the optional scheme of one, above-mentioned matching module comprises to realize performing to current URL the matching operation indicated in filtered fields by performing following steps: judge the matched whether met between the characteristic parameter in current URL and the characteristic parameter in filtered fields in configuration file, wherein, characteristic parameter comprises: the status code of the webpage corresponding to current URL, and/or for the Content length field of the size that represents the webpage corresponding to current URL; Above-mentioned 3rd judge module is by performing following steps to realize whether meeting the matching result indicated in filtered fields judges whether that successfully carrying out coupling to current URL comprises according to performing the result that obtains of matching operation: if the characteristic parameter in current URL and when meeting in configuration file matched between the characteristic parameter in filtered fields, then judge successfully to mate current URL.
Specifically be described in conjunction with following example, shown in composition graphs 6, suppose to perform the matching operation that indicates in filtered fields for mate characteristic parameter to current URL, then judge whether the characteristic parameter in current URL meets predetermined matched with the characteristic parameter in filtered fields.Such as, characteristic parameter in above-mentioned filtered fields is status code " HttpCode ", predetermined matched be "=; 200 ", then judge that characteristic parameter in current URL (namely, status code " HttpCode ") whether equal 200, if judge, the status code " HttpCode " in current URL meets above-mentioned matched, then judge successfully to mate above-mentioned current URL.
As the optional scheme of one, above-mentioned matching module comprises to realize performing to current URL the matching operation indicated in filtered fields by performing following steps: judge the matching condition whether met between the feature string in current URL and the feature string in filtered fields in configuration file, wherein, feature string comprise following one of at least: the partial character string in the link of current URL, current URL link; Above-mentioned 3rd judge module is by performing following steps to realize whether meeting the matching result indicated in filtered fields judges whether that successfully carrying out coupling to current URL comprises according to performing the result that obtains of matching operation: if the feature string in current URL and when meeting in configuration file matching condition between the character string in filtered fields, then judge successfully to mate current URL.
Specifically be described in conjunction with following example, shown in composition graphs 6, suppose to perform the matching operation that indicates in filtered fields for mate feature string to current URL, then judge whether the feature string in current URL meets predetermined matching condition with the feature string in filtered fields.Such as, feature string in above-mentioned filtered fields is message text " Content ", predetermined matching condition be " substr=; stc=" http://news.sina.com/gj/303/data.js " ", then judge that feature string in current URL (namely, message text " Content ") whether meet above-mentioned matching condition, the concatenation character string of the world news in the instruction Sina News such as shown in Fig. 6, if find above-mentioned feature string in current URL, then judge successfully to mate above-mentioned current URL.
Again such as, canonical matched is set, using the partial character string in complete character string as feature string, adopt the mode of canonical coupling, judge whether to comprise in above-mentioned current URL feature string set in canonical matched, to realize filtering the URL comprising certain specific partial character string, make the filtration to current URL in the present embodiment, can only filter for the URL comprising certain specific class character string.
By embodiment provided by the invention, by the characteristic parameter in current URL and/or feature string are mated according to predetermined matching way with characteristic parameter set in filtered fields and/or feature string, achieve the URL that the spam page accurately judged in above-mentioned URL to be detected is corresponding, thus URL corresponding for above-mentioned spam page is accurately filtered, and then Web security sweep is carried out to the URL after filtering, reach the effect of the efficiency improving Web security sweep.
As the optional scheme of one, default configuration file is multiple configuration file, and wherein, filter element 704 comprises:
1) module is searched, for searching coupling configuration file from multiple configuration file, wherein, judge that current URL is URL to be detected and successfully mates current URL according to the filtered fields in coupling configuration file according to the filter identifier in coupling configuration file;
2) filtering module, as long as finding out a coupling configuration file from multiple configuration file, then filter out current URL from pending URL set.
Be described shown in concrete composition graphs 5, from the step S518 in said process, current URL (such as, URL_3) the unsuccessful coupling of configuration file P_2 of mating is being performed with current, then return again to search and judge new configuration file, searched from multiple configuration file again by above-mentioned module of searching and read the configuration file that does not also perform matching operation, such as, this configuration file is configuration file P_3.If current URL (such as, URL_3) with current performing mate for configuration file P_2 successful match, namely configuration file P_2 searches the coupling configuration file that module searches goes out, then perform step S520 by above-mentioned filtering module, the above-mentioned current URL (such as, URL_3) be judged out corresponding to the webpage of spam page is filtered out.
As the optional scheme of one, said apparatus also comprises:
1) scanning element, for after performing filter operation to each URL in pending URL set, performs safe web page scan operation to the pending webpage indicated by each URL filtered out in the pending URL set of current URL.
Specifically be described in conjunction with following example, after filtering out URL corresponding to spam page, after filtering out URL corresponding to spam page in being gathered by above-mentioned URL, remaining URL preserves, and when when performing safe web page scan operation, directly calls the above-mentioned URL not containing spam page preserved.
By embodiment provided by the invention, by performing safe web page scan operation to the pending webpage indicated by each URL filtered out in the pending URL set of URL corresponding to spam page, thus achieve and avoid performing safe web page scan operation to URL corresponding to spam page, reach the effect improving Web security sweep.
The invention described above embodiment sequence number, just to describing, does not represent the quality of embodiment.
Embodiment 3
According to the embodiment of the present invention, additionally provide a kind of server of filtration of the uniform resource position mark URL for implementing above-mentioned webpage, as shown in Figure 8, this server comprises:
1) storer 802, is set to store above-mentioned for performing the configuration file that filters the uniform resource position mark URL of webpage and completing the URL after filtration.
Alternatively, in the present embodiment, the content stored in above-mentioned storer 802 also can obtain from other servers except filtering server 102, and the present embodiment does not do any restriction to this.
Alternatively, in the present embodiment, above-mentioned storer 802 can also be used for storing other data stored in the filter process in above-described embodiment 1.
2) processor 804, is set to perform following operation to the modules in the filtration unit of the uniform resource position mark URL of above-mentioned webpage;
S1, obtains pending URL and gathers, and wherein, pending URL set comprises the URL of multiple pending webpage;
S2, performs following filter operation to each URL in pending URL set, and wherein, in pending URL set, below current execution, the URL of filter operation is current URL:
According to the filter identifier in the configuration file preset, S20, judges whether current URL is URL to be detected;
S22, if URL is URL to be detected, then mates current URL according to the filtered fields in configuration file;
S24, if successfully mate current URL according to filtered fields, then from pending
Current URL is filtered out in URL set.
Alternatively, in the present embodiment, above-mentioned processor 804 of depositing also is set to perform following operation to realize judging whether current URL is URL to be detected according to the filter identifier in the configuration file preset:
S1, if filter identifier is be used to indicate the field of filtering all webpages, then judges that current URL is URL to be detected; Or
S2, if filter identifier is be used to indicate the field of filtering default domain name, then judges whether comprise default domain name in current URL, if current URL comprises default domain name, then judges that current URL is URL to be detected.
Alternatively, in the present embodiment, above-mentioned processor 804 of depositing also is set to perform following operation to realize mating current URL according to the filtered fields in configuration file:
S1, performs the matching operation indicated in filtered fields to current URL;
Whether S2, meet the matching result indicated in filtered fields judge whether successfully to mate current URL according to performing the result that obtains of matching operation.
3) communication interface 806, is set to carry out data interaction with above-mentioned web page server 104.
Alternatively, in the present embodiment, above-mentioned communication interface 806 is also set to other servers carried out in the process of filtering except above-mentioned web page server 104 with the uniform resource position mark URL of webpage and carries out data interaction.
Alternatively, the concrete example in the present embodiment can with reference to the example described in above-described embodiment 1 and embodiment 2, and the present embodiment does not repeat them here.
Embodiment 4
According to the embodiment of the present invention, provide a kind of storage medium, above-mentioned storage medium can be applied in hardware environment as shown in Figure 1.Alternatively, above-mentioned storage medium can be, but not limited to be arranged in for performing to the uniform resource position mark URL of webpage the filtering server 102 filtered.
Alternatively, in the present embodiment, above-mentioned storage medium can be applied in the filtration of the uniform resource position mark URL of webpage.
Alternatively, in the present embodiment, storage medium is set to store the program code for performing following steps:
S1, obtains pending URL and gathers, and wherein, pending URL set comprises the URL of multiple pending webpage;
S2, performs following filter operation to each URL in pending URL set, and wherein, in pending URL set, below current execution, the URL of filter operation is current URL:
According to the filter identifier in the configuration file preset, S20, judges whether current URL is URL to be detected;
S22, if URL is URL to be detected, then mates current URL according to the filtered fields in configuration file;
S24, if successfully mate current URL according to filtered fields, then filters out current URL from pending URL set.
Alternatively, storage medium is also set to store for performing following steps to realize judging that whether current URL is the program code of URL to be detected according to the filter identifier in default configuration file:
S1, if filter identifier is be used to indicate the field of filtering all webpages, then judges that current URL is URL to be detected; Or
S2, if filter identifier is be used to indicate the field of filtering default domain name, then judges whether comprise default domain name in current URL, if current URL comprises default domain name, then judges that current URL is URL to be detected.
Alternatively, storage medium is also set to store for performing following steps to realize the program code mated current URL according to the filtered fields in configuration file:
S1, performs the matching operation indicated in filtered fields to current URL;
Whether S2, meet the matching result indicated in filtered fields judge whether successfully to mate current URL according to performing the result that obtains of matching operation.
Alternatively, in the present embodiment, above-mentioned storage medium can include but not limited to: USB flash disk, ROM (read-only memory) (ROM, Read-OnlyMemory), random access memory (RAM, RandomAccessMemory), portable hard drive, magnetic disc or CD etc. various can be program code stored medium.
Alternatively, the concrete example in the present embodiment can with reference to the example described in above-described embodiment 1 and embodiment 2, and the present embodiment does not repeat them here.
The invention described above embodiment sequence number, just to describing, does not represent the quality of embodiment.
If the integrated unit in above-described embodiment using the form of SFU software functional unit realize and as independently production marketing or use time, can be stored in the storage medium that above computer can read.Based on such understanding, the part that technical scheme of the present invention contributes to prior art in essence in other words or all or part of of this technical scheme can embody with the form of software product, this computer software product is stored in storage medium, comprises all or part of step of some instructions in order to make one or more computer equipment (can be personal computer, server or the network equipment etc.) perform method described in each embodiment of the present invention.
In the above embodiment of the present invention, the description of each embodiment is all emphasized particularly on different fields, in certain embodiment, there is no the part described in detail, can see the associated description of other embodiments.
In several embodiments that the application provides, should be understood that, disclosed client, the mode by other realizes.Wherein, device embodiment described above is only schematic, the such as division of described unit, be only a kind of logic function to divide, actual can have other dividing mode when realizing, such as multiple unit or assembly can in conjunction with or another system can be integrated into, or some features can be ignored, or do not perform.Another point, shown or discussed coupling each other or direct-coupling or communication connection can be by some interfaces, and the indirect coupling of unit or module or communication connection can be electrical or other form.
The described unit illustrated as separating component or can may not be and physically separates, and the parts as unit display can be or may not be physical location, namely can be positioned at a place, or also can be distributed in multiple network element.Some or all of unit wherein can be selected according to the actual needs to realize the object of the present embodiment scheme.
In addition, each functional unit in each embodiment of the present invention can be integrated in a processing unit, also can be that the independent physics of unit exists, also can two or more unit in a unit integrated.Above-mentioned integrated unit both can adopt the form of hardware to realize, and the form of SFU software functional unit also can be adopted to realize.
The above is only the preferred embodiment of the present invention; it should be pointed out that for those skilled in the art, under the premise without departing from the principles of the invention; can also make some improvements and modifications, these improvements and modifications also should be considered as protection scope of the present invention.

Claims (16)

1. a filter method for the uniform resource position mark URL of webpage, is characterized in that, comprising:
Obtain pending URL to gather, wherein, described pending URL set comprises the URL of multiple pending webpage;
Perform following filter operation to each URL in described pending URL set, wherein, in described pending URL set, below current execution, the URL of filter operation is current URL:
Judge whether described current URL is URL to be detected according to the filter identifier in the configuration file preset;
If described URL is described URL to be detected, then according to the filtered fields in described configuration file, described current URL is mated;
If successfully mate described current URL according to described filtered fields, then from described pending URL set, filter out described current URL.
2. method according to claim 1, is characterized in that, the filter identifier in the configuration file that described basis is preset judges whether described current URL is that URL to be detected comprises:
If described filter identifier is be used to indicate the field of filtering all webpages, then judge that described current URL is described URL to be detected; Or
If described filter identifier is be used to indicate the field of filtering default domain name, then judge whether comprise described default domain name in described current URL, if described current URL comprises described default domain name, then judge that described current URL is described URL to be detected.
3. method according to claim 1, is characterized in that, describedly carries out coupling according to the filtered fields in described configuration file to described current URL and comprises:
The matching operation indicated in described filtered fields is performed to described current URL;
Whether meet the matching result indicated in described filtered fields judge whether successfully to mate described current URL according to performing the result that obtains of described matching operation.
4. method according to claim 3, is characterized in that,
Perform to described current URL the matching operation indicated in described filtered fields to comprise: judge the matched whether met between the characteristic parameter in described current URL and the characteristic parameter in described filtered fields in described configuration file, wherein, described characteristic parameter comprises: the status code of the webpage corresponding to described current URL, and/or for the Content length field of the size that represents the described webpage corresponding to described current URL;
Whether meet the matching result indicated in described filtered fields judge whether that successfully carrying out coupling to described current URL comprises according to performing the result that obtains of described matching operation: if the characteristic parameter in described current URL and meet in described configuration file matched between the characteristic parameter in described filtered fields time, then judge successfully to mate described current URL.
5. method according to claim 3, is characterized in that,
Perform to described current URL the matching operation indicated in described filtered fields to comprise: judge the matching condition whether met between the feature string in described current URL and the feature string in described filtered fields in described configuration file, wherein, described feature string comprise following one of at least: the partial character string in the link of described current URL, described current URL link;
Whether meet the matching result indicated in described filtered fields judge whether that successfully carrying out coupling to described current URL comprises according to performing the result that obtains of described matching operation: if the feature string in described current URL and meet in described configuration file matching condition between the character string in described filtered fields time, then judge successfully to mate described current URL.
6. method according to claim 1, is characterized in that, described configuration file is the file formed by the json character string comprising described filter identifier and described filtered fields.
7. method according to any one of claim 1 to 6, it is characterized in that, described default configuration file is multiple configuration file, wherein, judge whether described current URL is URL to be detected by the following steps filter identifier performed in the configuration file that described basis presets, according to the filtered fields in described configuration file, described current URL mated, from described pending URL set, filter out described current URL:
Coupling configuration file is searched from described multiple configuration file, wherein, judge that described current URL is described URL to be detected and successfully mates described current URL according to the described filtered fields in described coupling configuration file according to the described filter identifier in described coupling configuration file;
As long as find out a described coupling configuration file from described multiple configuration file, then from described pending URL set, filter out described current URL.
8. method according to any one of claim 1 to 6, is characterized in that, after performing described filter operation to each URL in described pending URL set, also comprises:
Safe web page scan operation is performed to the described pending webpage indicated by each URL filtered out in the described pending URL set of described current URL.
9. a filtration unit for the uniform resource position mark URL of webpage, is characterized in that, comprising:
Acquiring unit, gathers for obtaining pending URL, and wherein, described pending URL set comprises the URL of multiple pending webpage;
Filter element, for performing following filter operation to each URL in described pending URL set, wherein, in described pending URL set, below current execution, the URL of filter operation is current URL:
Judge whether described current URL is URL to be detected according to the filter identifier in the configuration file preset;
When described URL is described URL to be detected, according to the filtered fields in described configuration file, described current URL is mated;
When successfully mating described current URL according to described filtered fields, from described pending URL set, filter out described current URL.
10. device according to claim 9, is characterized in that, described filter element comprises:
First judge module, for being be used to indicate the field of filtering all webpages in described filter identifier, judges that described current URL is described URL to be detected; Or
Second judge module, for being be used to indicate the field of filtering default domain name in described filter identifier, then judge whether comprise described default domain name in described current URL, if described current URL comprises described default domain name, judge that described current URL is described URL to be detected.
11. devices according to claim 9, is characterized in that, described filter element comprises:
Matching module, for performing to described current URL the matching operation indicated in described filtered fields;
3rd judge module, whether the result for obtaining according to the described matching operation of execution meets the matching result indicated in described filtered fields judges whether successfully to mate described current URL.
12. devices according to claim 11, is characterized in that,
Described matching module comprises to realize performing to described current URL the matching operation indicated in described filtered fields by performing following steps: judge the matched whether met between the characteristic parameter in described current URL and the characteristic parameter in described filtered fields in described configuration file, wherein, described characteristic parameter comprises: the status code of the webpage corresponding to described current URL, and/or for the Content length field of the size that represents the described webpage corresponding to described current URL;
Described 3rd judge module is by performing following steps to realize whether meeting the matching result indicated in described filtered fields judges whether that successfully carrying out coupling to described current URL comprises according to performing the result that obtains of described matching operation: if the characteristic parameter in described current URL and when meeting in described configuration file matched between the characteristic parameter in described filtered fields, then judge successfully to mate described current URL.
13. devices according to claim 11, is characterized in that,
Described matching module comprises to realize performing to described current URL the matching operation indicated in described filtered fields by performing following steps: judge the matching condition whether met between the feature string in described current URL and the feature string in described filtered fields in described configuration file, wherein, described feature string comprise following one of at least: the partial character string in the link of described current URL, described current URL link;
Described 3rd judge module is by performing following steps to realize whether meeting the matching result indicated in described filtered fields judges whether that successfully carrying out coupling to described current URL comprises according to performing the result that obtains of described matching operation: if the feature string in described current URL and when meeting in described configuration file matching condition between the character string in described filtered fields, then judge successfully to mate described current URL.
14. devices according to claim 9, is characterized in that, described configuration file is the file formed by the json character string comprising described filter identifier and described filtered fields.
15. devices according to any one of claim 9 to 14, it is characterized in that, described default configuration file is multiple configuration file, and wherein, described filter element comprises:
Search module, for searching coupling configuration file from described multiple configuration file, wherein, judge that described current URL is described URL to be detected and successfully mates described current URL according to the described filtered fields in described coupling configuration file according to the described filter identifier in described coupling configuration file;
Filtering module, as long as finding out a described coupling configuration file from described multiple configuration file, then filters out described current URL from described pending URL set.
16. devices according to any one of claim 9 to 14, is characterized in that, also comprise:
Scanning element, for after performing described filter operation to each URL in described pending URL set, safe web page scan operation is performed to the described pending webpage indicated by each URL filtered out in the described pending URL set of described current URL.
CN201410284750.6A 2014-06-23 2014-06-23 The filter method and device of the uniform resource position mark URL of webpage Active CN105302815B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410284750.6A CN105302815B (en) 2014-06-23 2014-06-23 The filter method and device of the uniform resource position mark URL of webpage

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410284750.6A CN105302815B (en) 2014-06-23 2014-06-23 The filter method and device of the uniform resource position mark URL of webpage

Publications (2)

Publication Number Publication Date
CN105302815A true CN105302815A (en) 2016-02-03
CN105302815B CN105302815B (en) 2019-06-07

Family

ID=55200091

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410284750.6A Active CN105302815B (en) 2014-06-23 2014-06-23 The filter method and device of the uniform resource position mark URL of webpage

Country Status (1)

Country Link
CN (1) CN105302815B (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106168977A (en) * 2016-07-15 2016-11-30 河南山谷网安科技股份有限公司 A kind of column recognition methods for web portal security monitoring
CN106227741A (en) * 2016-07-12 2016-12-14 国家计算机网络与信息安全管理中心 A kind of extensive URL matching process based on multilevel hash index chained list
CN107066510A (en) * 2017-01-22 2017-08-18 南方科技大学 A kind of information processing method and device
CN108595586A (en) * 2018-04-19 2018-09-28 杭州迪普科技股份有限公司 A kind of determination method and device of search key
CN109639686A (en) * 2018-12-17 2019-04-16 江苏满运软件科技有限公司 Distributed Webpage filtering method, device, electronic equipment, storage medium
CN113411332A (en) * 2021-06-18 2021-09-17 杭州安恒信息技术股份有限公司 CORS vulnerability detection method, device, equipment and medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1798147A (en) * 2004-12-28 2006-07-05 华为技术有限公司 Method for matching uniform resource locator
CN102110132A (en) * 2010-12-08 2011-06-29 北京星网锐捷网络技术有限公司 Uniform resource locator matching and searching method, device and network equipment
CN102780681A (en) * 2011-05-11 2012-11-14 中兴通讯股份有限公司 URL (Uniform Resource Locator) filtering system and URL filtering method
CN103793462A (en) * 2013-12-02 2014-05-14 北京奇虎科技有限公司 URL (uniform resource locator) purifying method and device

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1798147A (en) * 2004-12-28 2006-07-05 华为技术有限公司 Method for matching uniform resource locator
CN102110132A (en) * 2010-12-08 2011-06-29 北京星网锐捷网络技术有限公司 Uniform resource locator matching and searching method, device and network equipment
CN102780681A (en) * 2011-05-11 2012-11-14 中兴通讯股份有限公司 URL (Uniform Resource Locator) filtering system and URL filtering method
CN103793462A (en) * 2013-12-02 2014-05-14 北京奇虎科技有限公司 URL (uniform resource locator) purifying method and device

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
刘涛: "基于综合接入设备的防火墙研究与实现", 《中国优秀硕士学位论文全文数据库》 *

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106227741A (en) * 2016-07-12 2016-12-14 国家计算机网络与信息安全管理中心 A kind of extensive URL matching process based on multilevel hash index chained list
CN106227741B (en) * 2016-07-12 2019-08-30 国家计算机网络与信息安全管理中心 A kind of extensive URL matching process based on multilevel hash index chained list
CN106168977A (en) * 2016-07-15 2016-11-30 河南山谷网安科技股份有限公司 A kind of column recognition methods for web portal security monitoring
CN106168977B (en) * 2016-07-15 2019-07-02 山谷网安科技股份有限公司 A kind of column recognition methods for web portal security monitoring
CN107066510A (en) * 2017-01-22 2017-08-18 南方科技大学 A kind of information processing method and device
CN107066510B (en) * 2017-01-22 2021-12-03 南方科技大学 Information processing method and device
CN108595586A (en) * 2018-04-19 2018-09-28 杭州迪普科技股份有限公司 A kind of determination method and device of search key
CN109639686A (en) * 2018-12-17 2019-04-16 江苏满运软件科技有限公司 Distributed Webpage filtering method, device, electronic equipment, storage medium
CN109639686B (en) * 2018-12-17 2022-02-25 江苏满运软件科技有限公司 Distributed webpage filtering method and device, electronic equipment and storage medium
CN113411332A (en) * 2021-06-18 2021-09-17 杭州安恒信息技术股份有限公司 CORS vulnerability detection method, device, equipment and medium

Also Published As

Publication number Publication date
CN105302815B (en) 2019-06-07

Similar Documents

Publication Publication Date Title
CN105302815A (en) Web page uniform resource locator URL filtering method and apparatus
WO2015085948A1 (en) Method, device, and server for friend recommendation
CN103077250B (en) A kind of capturing webpage contents method and device
CN108573146A (en) A kind of malice URL detection method and device
CN104462500A (en) Method for determining activeness of controls and control processing method and device
CN106484775A (en) A kind of crawler capturing method and system based on selenium
CN105224661A (en) Conversational information search method and device
CN102904765A (en) Method and equipment for data reporting
CN105404631A (en) Picture identification method and apparatus
CN104765746A (en) Data processing method and device for mobile communication terminal browser
CN105468981A (en) Vulnerability identification technology-based plugin safety scanning device and scanning method
CN105262720A (en) Web robot traffic identification method and device
CN105516114B (en) A kind of method, apparatus and electronic equipment based on webpage cryptographic Hash scanning loophole
CN108073693A (en) A kind of distributed network crawler system based on Hadoop
CN105989019B (en) A kind of method and device for cleaning data
CN101739401A (en) Network search method and equipment
CN103716419B (en) The domain name processing method and system of a kind of cross-terminal
CN104899320A (en) Webpage repair method, terminal, server and system
CN104281680A (en) Data processing system, method and device for acquiring website resources
CN111031068B (en) DNS analysis method based on complex network
CN105205134A (en) Method and device for recognizing behavior of clicking to access website by user
CN105630983A (en) Resource obtaining and optimizing device and method
CN108121751B (en) Webpage crawling method and device
CN104765747A (en) Webpage processing method and device
CN106933860B (en) Malicious Uniform Resource Locator (URL) identification method and device

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant