CN103530336A - Equipment and method for identifying invalid parameters in URLs - Google Patents

Equipment and method for identifying invalid parameters in URLs Download PDF

Info

Publication number
CN103530336A
CN103530336A CN201310462262.5A CN201310462262A CN103530336A CN 103530336 A CN103530336 A CN 103530336A CN 201310462262 A CN201310462262 A CN 201310462262A CN 103530336 A CN103530336 A CN 103530336A
Authority
CN
China
Prior art keywords
url
combination
fragment
invalid
parameter
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201310462262.5A
Other languages
Chinese (zh)
Other versions
CN103530336B (en
Inventor
魏少俊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Qihoo Technology Co Ltd
Original Assignee
Beijing Qihoo Technology Co Ltd
Qizhi Software Beijing Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Qihoo Technology Co Ltd, Qizhi Software Beijing Co Ltd filed Critical Beijing Qihoo Technology Co Ltd
Priority to CN201310462262.5A priority Critical patent/CN103530336B/en
Publication of CN103530336A publication Critical patent/CN103530336A/en
Priority to PCT/CN2014/083216 priority patent/WO2015043308A1/en
Application granted granted Critical
Publication of CN103530336B publication Critical patent/CN103530336B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/955Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
    • G06F16/9566URL specific, e.g. using aliases, detecting broken or misspelled links

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Transfer Between Computers (AREA)
  • Computer And Data Communications (AREA)

Abstract

The invention relates to the technical field of search engines and discloses equipment and method for identifying invalid parameters in URLs. The equipment includes a URL obtaining unit, a URL fragment combination extracting unit, a statistical unit and a validity judging unit, wherein the URL obtaining unit is suitable for obtaining the URLs of a plurality of webpage links; the URL fragment combination extracting unit is suitable for extracting URL fragment combinations out of the URLs of the obtained webpage links respectively; the statistical unit is used for the statistics of occurrence frequency of all the URL fragment combinations and determines the URL fragment combinations which meet the preset conditions of the occurrence frequency to be target URL fragment combinations; the validity judging unit is suitable for performing judgment on the validity of all URL parameters in the target URL fragment combinations for all the target URL fragment combinations according to the URLs containing the target URL fragment combinations. According to the equipment and method for identifying the invalid parameters in the URLs, the efficiency for identifying duplicate links is enhanced, and then the efficiency for grabbing information of the search engines is improved.

Description

Equipment and method for identifying invalid parameters in Uniform Resource Locator (URL)
Technical Field
The invention relates to the technical field of search engines, in particular to equipment and a method for identifying invalid parameters in Uniform Resource Locators (URLs).
Background
With the rapid development of computer network technology and the rapid popularization of computer equipment, more and more people utilize computers and the internet to acquire information, the internet can bring more and more abundant and diversified services to people, data existing on the internet shows explosive growth, and the number of links of Chinese web pages on the internet reaches trillion scale by taking Chinese web pages as an example.
The search engine is a technology which is accompanied with the explosive growth process of internet information and aims to meet the requirement that people search required information in mass information of the internet. On one hand, a search engine uses a certain strategy and a specific search program to collect various information in the Internet, and further processes and arranges the information; on the other hand, the search engine displays the processed and sorted information to the user in a certain order to meet the retrieval requirements of the user. When a search engine collects internet information, an important criterion is uniform resource locator URL (which can also be understood as a web address corresponding to a web page), because URL is unique for each web page, i.e. each web page on the internet corresponds to a unique URL, and the search engine can obtain information in the corresponding web page according to the URL. However, in the huge number of URLs on the internet, there are different pages corresponding to different URLs, but the content of the page itself is the same, especially the situation that the URL is different but the main content of the web page is the same is rapidly increasing due to the increasing use of the current dynamic web page technology, which poses a problem to the application of the search engine technology: how to identify repeated links in a large number of URLs to reduce the recording of repeated information and improve the information collecting efficiency.
Disclosure of Invention
In view of the above problems, the present invention has been made to provide an identification apparatus of invalid parameters in a uniform resource locator URL, and a corresponding identification method of invalid parameters in a uniform resource locator URL, which overcome or at least partially solve the above problems.
According to an aspect of the present invention, there is provided an apparatus for identifying an invalid parameter in a uniform resource locator URL, including:
the URL acquisition unit is suitable for acquiring URLs of a plurality of webpage links;
the URL fragment combination extraction unit is suitable for extracting the URL fragment combination from the obtained URLs of the webpage links respectively;
the statistical unit is suitable for counting the occurrence frequency of each URL segment combination and determining the URL segment combination with the occurrence frequency meeting the preset conditions as a target URL segment combination;
and the validity judging unit is suitable for judging the validity of each URL parameter in the target URL fragment group aiming at each target URL fragment combination based on the URL containing the target URL fragment combination.
Optionally, the method further comprises:
the storage unit is used for storing the result of judging the validity of each URL parameter in the target URL fragment group by the validity judging unit as an invalid fragment combination list;
the URL extraction unit to be detected is suitable for acquiring a URL address to be detected corresponding to the webpage link to be detected;
the URL fragment combination extraction unit is suitable for extracting a URL fragment combination from the URL address to be detected;
and the URL parameter detection unit is suitable for judging the validity of the URL parameter in the URL fragment combination according to the invalid fragment combination list.
Optionally, the URL segment combination extracting unit is adapted to:
and extracting the file name of the dynamic file and the corresponding URL parameter from the URL address to be detected, and combining the extracted file name of the dynamic file and the corresponding URL parameter to be used as the URL fragment combination.
Optionally, the invalid segment combination list stores invalid segment combinations and validity information of each URL parameter in the combinations.
Optionally, the URL parameter detecting unit is adapted to: inquiring the invalid fragment combination list by the URL fragment combination, and inquiring whether a matched invalid fragment combination exists in the invalid fragment combination list;
if the URL exists, judging the validity of the URL parameters in the URL fragment combination according to the matched invalid fragment combination and the validity information of each URL parameter in the URL fragment combination.
Optionally, the statistical unit includes:
the first statistical subunit is suitable for counting the number of URLs containing the same URL segment combination, determining the number as the occurrence frequency of the URL segment combination, and determining the URL segment combination with the occurrence frequency meeting the preset conditions as a target URL segment combination;
or,
the second statistical subunit is suitable for counting the number of different internet positions corresponding to the same URL segment combination, determining the number as the occurrence frequency of the URL segment combination, and determining the URL segment combination with the occurrence frequency meeting the preset conditions as a target URL segment combination; wherein the internet location is determined by a network path in the URL.
Optionally, the statistical unit includes:
a third counting subunit, adapted to count the number of URLs containing the same URL segment combination, and determine the number as the first occurrence frequency of the URL segment combination;
the fourth counting subunit is suitable for counting the number of different internet positions corresponding to the same URL segment combination and determining the number as the second occurrence frequency of the URL segment combination; wherein the internet location is determined by a network path in the URL;
and the determining subunit is suitable for determining the URL segment combination with the first occurrence frequency and the second occurrence frequency meeting the preset condition as the target URL segment combination.
Optionally, the determining subunit includes:
the joint frequency calculating subunit is suitable for calculating the joint frequency of the URL segment combination according to the first occurrence frequency, the second occurrence frequency and respective preset weights;
and the joint determining subunit is suitable for determining the URL fragment combination with the joint frequency meeting the preset condition as the target URL fragment combination.
Optionally, the validity judging unit includes:
the sampling unit is suitable for extracting a preset number of URLs from URLs containing the target URL fragment combination;
and the validity judgment subunit is suitable for judging the validity of each parameter in the target URL segment combination based on the preset number of URLs extracted by the sampling unit.
Optionally, the validity judging unit is specifically adapted to:
aiming at each target URL fragment combination, comparing the URL containing the target URL fragment combination with the variation condition of the webpage content before and after each parameter of the URL is removed, and if the webpage content before and after a certain parameter is removed is consistent, determining that the parameter is an invalid parameter corresponding to the parameter in the target URL fragment group.
Optionally, the URL segment combination extraction unit is specifically adapted to:
if a URL contains the file name of the dynamic file and at least two corresponding parameters, the file name of the dynamic file and the corresponding parameters are taken as URL fragment combinations in the URL to be extracted.
Optionally, the validity judging unit is adapted to:
and aiming at each target URL fragment combination, extracting URLs distributed at different internet positions in preset number from the URLs comprising the target URL fragment combination, and judging the effectiveness of each parameter in the target URL fragment group based on the extracted URLs.
According to another aspect of the present invention, there is provided a method for identifying invalid parameters in a URL, including:
acquiring URLs of a plurality of webpage links;
respectively extracting the URL fragment combinations from the obtained URLs of the plurality of webpage links;
counting the occurrence frequency of each URL segment combination, and determining the URL segment combination with the occurrence frequency meeting the preset conditions as a target URL segment combination;
and aiming at each target URL fragment combination, judging the validity of each URL parameter in the target URL fragment group based on the URL containing the target URL fragment combination.
Optionally, the method further comprises:
storing the result of judging the validity of each URL parameter in the target URL fragment group by the validity judging unit as an invalid fragment combination list;
acquiring a URL address to be detected corresponding to a webpage link to be detected;
extracting a URL fragment combination from the URL address to be detected;
and judging the validity of the URL parameter in the URL fragment combination according to the invalid fragment combination list.
Optionally, the extracting a URL segment combination from the URL address to be detected includes:
extracting a file name of a dynamic file and a corresponding URL parameter from the URL address to be detected, and combining the extracted file name of the dynamic file and the corresponding URL parameter to be used as the URL fragment combination;
and the invalid fragment combination list stores invalid fragment combinations and validity information of each parameter in the combinations.
Optionally, the invalid segment combination list stores invalid segment combinations and validity information of each URL parameter in the combinations.
Optionally, the determining, according to the invalid segment combination list, the validity of the URL parameter in the URL segment combination includes:
inquiring the invalid fragment combination list by the URL fragment combination, and inquiring whether a matched invalid fragment combination exists in the invalid fragment combination list;
if the URL exists, judging the validity of the URL parameters in the URL fragment combination according to the matched invalid fragment combination and the validity information of each URL parameter in the URL fragment combination.
Optionally, the counting the occurrence frequency of each URL segment combination, and determining the URL segment combination whose occurrence frequency meets a preset condition as a target URL segment combination includes:
counting the number of URLs including the same URL segment combination, determining the number as the occurrence frequency of the URL segment combination, and determining the URL segment combination with the occurrence frequency meeting the preset conditions as a target URL segment combination;
or,
counting the number of different internet positions corresponding to the same URL segment combination, determining the number as the occurrence frequency of the URL segment combination, and determining the URL segment combination with the occurrence frequency meeting the preset conditions as a target URL segment combination; wherein the internet location is determined by a network path in the URL.
Optionally, the counting the occurrence frequency of each URL segment combination, and determining the URL segment combination whose occurrence frequency meets a preset condition as a target URL segment combination includes:
counting the number of URLs including the same URL segment combination, and determining the number as the first occurrence frequency of the URL segment combination;
counting the number of different internet positions corresponding to the same URL segment combination, and determining the number as the second occurrence frequency of the URL segment combination; wherein the internet location is determined by a network path in the URL;
and determining the URL segment combination with the first occurrence frequency and the second occurrence frequency meeting the preset conditions as a target URL segment combination.
Optionally, the determining, as a target URL segment combination, a URL segment combination whose first frequency of occurrence and second frequency of occurrence meet a preset condition includes:
calculating the joint frequency of the URL segment combination according to the first occurrence frequency, the second occurrence frequency and respective preset weights;
and determining the URL fragment combination with the joint frequency meeting the preset condition as a target URL fragment combination.
Optionally, the determining, for each target URL segment combination, validity of each parameter in the target URL segment group based on the URL including the target URL segment combination includes:
extracting a preset number of URLs from URLs containing the target URL fragment combination;
and judging the validity of each parameter in the target URL fragment combination based on the preset number of URLs extracted by the sampling unit.
Optionally, the determining, for each target URL segment combination, validity of each parameter in the target URL segment group based on the URL including the target URL segment combination includes:
aiming at each target URL fragment combination, comparing the URL containing the target URL fragment combination with the variation condition of the webpage content before and after each parameter of the URL is removed, and if the webpage content before and after a certain parameter is removed is consistent, determining that the parameter is an invalid parameter corresponding to the parameter in the target URL fragment group.
Optionally, the extracting the URL segment combination from each URL respectively includes:
if a URL contains the file name of the dynamic file and at least two corresponding parameters, the file name of the dynamic file and the corresponding parameters are taken as URL fragment combinations in the URL to be extracted.
Optionally, the determining, for each target URL segment combination, validity of each parameter in the target URL segment group based on the URL including the target URL segment combination includes:
and aiming at each target URL fragment combination, extracting URLs distributed at different internet positions in preset number from the URLs comprising the target URL fragment combination, and judging the effectiveness of each parameter in the target URL fragment group based on the extracted URLs.
According to the identification device of the invalid parameters in the uniform resource locator URL, the URLs of a plurality of webpage links can be obtained; extracting URL fragment combinations from the obtained URLs of the plurality of webpage links; counting the occurrence frequency of each URL segment combination, and determining the URL segment combination with the occurrence frequency meeting the preset conditions as a target URL segment combination; and aiming at each target URL fragment combination, judging the validity of each URL parameter in the target URL fragment group based on the URL containing the target URL fragment combination. Based on the acquired URLs of the plurality of webpage links, the same URL segment combination meeting the conditions is judged at one time through statistics and filtering, and the problems that when a search engine traditionally identifies invalid parameters in repeated links, the collected invalid parameters of all the links need to be detected, all the possibilities of each parameter need to be exhausted, and the judgment is respectively carried out one by one, and the identification efficiency is low are solved. The method and the device achieve the purposes of quickly identifying the parameters in the invalid links and improving the efficiency of identifying the repeated links.
The foregoing description is only an overview of the technical solutions of the present invention, and the embodiments of the present invention are described below in order to make the technical means of the present invention more clearly understood and to make the above and other objects, features, and advantages of the present invention more clearly understandable.
Drawings
Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the invention. Also, like reference numerals are used to refer to like parts throughout the drawings. In the drawings:
FIG. 1 illustrates a flow diagram of a method for identifying invalid parameters in a Uniform Resource Locator (URL) address in accordance with one embodiment of the present invention;
FIG. 2 is a diagram illustrating an apparatus for identifying invalid parameters in a Uniform Resource Locator (URL) address according to one embodiment of the invention;
FIG. 3 is a diagram illustrating another apparatus for identifying invalid parameters in a Uniform Resource Locator (URL) address according to an embodiment of the invention;
FIG. 4 is a schematic diagram of an apparatus for identifying invalid parameters in a URL address of a further uniform resource locator in accordance with one embodiment of the present invention;
FIG. 5 is a diagram illustrating an apparatus for identifying invalid parameters in a Uniform Resource Locator (URL) address according to an embodiment of the invention;
FIG. 6 is a schematic diagram of an apparatus for identifying invalid parameters in a URL address of yet another uniform resource locator in accordance with one embodiment of the present invention;
FIG. 7 is a schematic diagram of an apparatus for identifying invalid parameters in a URL address of yet another uniform resource locator in accordance with one embodiment of the present invention; and
fig. 8 is a schematic diagram illustrating an example of an application of the method for identifying an invalid parameter in a URL address according to an embodiment of the present invention.
Detailed Description
Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.
Referring to fig. 1, a flowchart of a method for identifying an invalid parameter in a URL address according to an embodiment of the present invention is shown, where the method includes the following steps:
s110, acquiring URLs of a plurality of webpage links;
the URL address to be detected can be obtained firstly, when invalid parameters of the URL address are detected, the URL address to be detected can be obtained firstly, the URL address to be detected can be grabbed by a search engine server, or the URL address of a browsed webpage can be extracted by a user browser to serve as the URL address to be detected. Or the URL to be detected on the Internet can be acquired more comprehensively by means of capturing through a search engine server and combining with capturing of a user browser. Aiming at the condition that invalid parameters are mostly in the website of the dynamic webpage, when the URL address to be detected is obtained, only the file name of the dynamic file and the address of the used parameters in the address can be obtained, and the website is taken as the URL address to be detected.
S120: respectively extracting the URL fragment combinations from the obtained URLs of the plurality of webpage links;
after the URL address to be detected is obtained, a URL segment combination including the dynamic file name included in the URL address to be detected and the used corresponding parameter name may be extracted from the URL address to be detected. The process of extracting the URL segment combination from the URL address to be detected may be a process of extracting the dynamic file name and the used parameters from the URL address to be detected, and combining the dynamic file name and the parameters extracted from the URL to be detected into the URL segment combination.
In the pages with different URLs and the same main content of the web page, the dynamic web page technology is mostly used, and the URL of such a page often includes the file name of the program file that runs dynamically and the parameters used by the program. The URL of a page may only include one parameter, or may include two or more parameters. The URL of the page using the dynamic web page technology can be used as the URL to be tested, and the dynamic file name and one or more URL parameters in the URL to be tested are extracted and combined together to be used as the URL fragment combination of the URL to be tested. For example, in step S110, the URL acquired with detection is:
http://bbs.xxxxx.com.cn/viewthread.phppage=1&sid=yyy&tid=zzzz
php is the dynamic file name contained in the URL to be tested, and the URL to be tested also contains two parameters of sid and tid. The dynamic file name and each parameter contained in the URL to be tested can be extracted and combined to be used as the URL fragment combination corresponding to the URL to be tested. For example, the URL segment combination composed of the dynamic file name in the URL to be detected and each parameter may be:
viewthread.php+sid+tid
in practical applications to computers, validity or non-validity of the URL segment combinations can be identified by binary numbers, such as binary 0 representing valid and binary 1 representing invalid.
S130: and counting the occurrence frequency of each URL segment combination, and determining the URL segment combination with the occurrence frequency meeting the preset conditions as a target URL segment combination.
In the process of counting the occurrence frequency of each URL segment combination and determining the URL segment combination with the occurrence frequency meeting the preset conditions as the target URL segment combination, URLs which do not occur frequently or have low click rate or contain specific segment combinations can be filtered, so that an invalid segment combination list is generated only by the URLs which occur frequently or have high click rate or contain the specific segment combinations, and a dynamic program which has high universality and affects a large number of URLs and the corresponding URL segment combinations thereof are selected to establish the invalid combination list, so that the invalid segment combinations in the invalid segment combination list have universality and wider practicability. Specifically, the occurrence frequency of the URL segment combinations in the sample URL may be counted to obtain the influence surface of each URL segment combination; or counting the number of different internet positions corresponding to the same URL segment combination to obtain the universality of each URL segment combination. Therefore, the specific filtering can be implemented in various ways, and the following description specifically describes the process of filtering the URL segment combinations as samples.
Firstly, a URL segment combination can be proposed from the acquired URL of the web page link as a sample, the extraction process is similar to the process of extracting the URL segment combination from the URL to be detected in S102, and the form of the extracted URL segment combination is similar to that of:
dynamic file + parameter list;
such as forum. php + authored, mod, page, tid, sid
Php represents the file name of the dynamic file extracted from the URL, and authored, mod, page, tid, sid represents the parameter name of each parameter extracted from the URL.
Assuming that the URL segment combinations and statistics are extracted from the URLs of each web page link, as shown in Table 1:
TABLE 1
Key Val1 Val2
forum.php+authorid,mod,page,tid,sid 10000 100000000
memberlist.php+first_char,mode,sk,sd 2387 27729179
digest.php+authorid 3 287
index.php+mulu,wenxueid 19 348
The Key column is the combination of each URL segment extracted from each URL, the Val1 column is the number of different internet positions corresponding to the same URL segment combination, and the universality of the URL segment combination can be represented; the Val2 is listed as the number of URLs where a certain URL segment combination appears, and may represent the influence surface of the URL segment combination, and each URL segment combination as a sample may be filtered according to Val1 and/or Val2, and during the filtering, the filtering may be performed only with Val1 or Val2, or may be performed with Val1 and Val2 at the same time, and the influence surface and the universality of the URL segment combination are verified.
Specifically, the number of URLs of each URL segment combination can be obtained, the number is determined as the occurrence frequency of the URL segment combination, and the URL segment combination with the occurrence frequency meeting the preset condition is determined as the target URL segment combination; counting the number of URLs including the same URL segment combination, determining the number as the occurrence frequency of the URL segment combination, and determining the URL segment combination with the occurrence frequency meeting the preset conditions as a target URL segment combination; URLs which are not frequently appeared or have low click rate or URLs containing specific fragment combinations can be filtered, so that an invalid fragment combination list is generated only by using URLs which are frequently appeared or have high click rate or URLs containing specific fragment combinations, and invalid fragment combinations in the invalid fragment combination list have more influence surfaces. As in table 1 above, assuming that the threshold set for the frequency is 1000, URL segments can be combined according to whether the occurrence frequency Val2 of URL segment combination meets the threshold:
php + authored, and,
php + mulu, wenxueid was filtered out.
Or, the number of different internet locations corresponding to the same URL segment combination may be counted, where the number may represent that the URL segment combination is used in different websites or used in different sub-sites of the same website, and objectively reflect the frequency of occurrence of the URL segment combination in different paths of the internet, and the greater the number, the higher the frequency of occurrence, the more general the invalid segment combination extracted therefrom, and vice versa. As in table 1 above, assuming that the threshold set for the pair is 200, the URL segment combinations may be combined according to whether the occurrence frequency Val1 of the URL segment combinations meets the threshold:
php + authored, and,
php + mulu, wenxueid was filtered out.
In another implementation, the number of URLs of the same URL segment combination and the number of different internet locations corresponding to the same URL segment combination may be used to jointly determine which URL segment combinations may be used as target URL segment combinations. Specifically, the number of URLs including the same URL segment combination may be counted, and the number is determined as the first occurrence frequency of the URL segment combination; counting the number of different internet positions corresponding to the same URL segment combination, and determining the number as the second occurrence frequency of the URL segment combination; wherein the internet location is determined by a network path in the URL; and then determining the URL segment combination with the first occurrence frequency and the second occurrence frequency meeting the preset conditions as a target URL segment combination. In the process of determining the URL segment combination with the first occurrence frequency and the second occurrence frequency meeting the preset conditions as the target URL segment combination, calculating the joint frequency of the URL segment combination according to the first occurrence frequency, the second occurrence frequency and respective preset weights; and determining the URL fragment combination with the joint frequency meeting the preset condition as a target URL fragment combination. For example, in table 1, corresponding thresholds may be set for Val1 and Val2 at the same time, and the URL segment combinations may be filtered using Val1 and Val2, so that a more effective determination result may be obtained after determination of the invalid parameter.
Further, the result of the validity judgment of each URL parameter in the target URL segment group by the validity judgment unit may be saved as an invalid segment combination list, and the validity of the URL parameter in the URL segment combination may be judged according to the invalid segment combination list.
In practical applications, the invalid fragment combination list may be a collection that holds a certain number of invalid fragment combinations. An invalid fragment combination, which includes a combination that detects that the URL fragment combination contains invalid URL parameters in a certain way, and the URL fragment combination and the validity of the corresponding parameters are stored together, if the URL fragment combination is detected by a certain means, the parameter sid is a valid parameter, the tid is an invalid parameter, the binary number 0 represents valid, the binary number 1 represents invalid, then the invalid fragment combination list can store the invalid fragment combination:
viewthread.php+sid(0)+tid(1)
that is, in the invalid fragment combination list, several invalid fragment combinations and validity information of each parameter in the combinations are stored. After the URL segment combination is extracted from the URL address to be detected, whether each URL parameter in the extracted URL segment combination is valid can be determined according to the invalid segment combination stored in the invalid segment combination list.
Due to different URL fragment combinations, the URL fragments can be distinguished by dynamic file names or URL parameters in the URL fragments; different invalid segment combinations can also be distinguished by dynamic file names or URL parameters. If the URL segment combination has the same dynamic file name and parameter name as a certain invalid segment combination stored in the invalid segment combination list, it can be considered that the URL segment combination and the invalid segment combination have a matching relationship, that is, represent the same dynamic file on the internet. Therefore, when judging whether the parameter in the URL fragment combination to be tested is valid according to the invalid fragment combination list, the URL fragment combination to be tested can be used for inquiring the invalid fragment combination list and inquiring whether a matched invalid fragment combination exists; if the URL parameter exists in the URL fragment combination, the validity of the URL parameter in the URL fragment combination is determined according to the matched invalid fragment combination and the validity information of each URL parameter in the URL fragment combination.
Wherein the invalid fragment combination list may be established as follows. The URLs of a plurality of web page links are first obtained, and these URLs may be regarded as a sample URL, which may be a sample web page link URL crawled by a search engine, or browser. From the sample URLs, URL segment combinations can be extracted, namely URL segment combinations are extracted from the obtained URLs of a plurality of webpage links respectively, and the URL segment combinations extracted from the sample URLs can be understood as a URL judgment combination with sample properties; next, judging the influence surface or universality of the combination according to the sample URL, filtering, selecting the URL fragment combination with a large influence surface and using the universal URL fragment combination, specifically, counting the occurrence frequency of each URL fragment combination, and determining the URL fragment combination with the occurrence frequency meeting the preset conditions as the target URL fragment combination; and finally, aiming at each target URL fragment combination, judging the validity of each URL parameter in the target URL fragment group based on the URL containing the target URL fragment combination.
The validity of each parameter in the sample URL fragment combination extracted by the sample URL is obtained. The dynamic file name, the parameters and the validity information of the parameters can be saved as an invalid fragment combination list. The process of creating the invalid fragment combination list may be regarded as extracting URL fragment combinations from a limited number of URL samples, and determining whether each parameter is valid, thereby creating a determination sample: procedure of invalidation parameter list.
It should be noted that, in general, if the URL includes one dynamic file name and at least two parameters, the URL can be distinguished from other URL segment combinations according to the file name and the corresponding at least two parameters, so that it can be determined whether the URL includes the file name of the dynamic file and the corresponding at least two parameters first, and if the URL includes a packet, the file name of the dynamic file and the corresponding parameters are extracted as the URL segment combination in the URL. Therefore, the extracted URL fragment combinations can be different from each other, and the effectiveness of invalid parameter detection is improved.
In addition, because the number of URLs including a certain target URL segment combination is very large, if parameters in such target URLs are identified, all URLs including the URL segment combination are detected as detection objects, the workload is very large, and particularly when the validity of each parameter in the target URL segment group is judged, a preset number of URLs can be extracted from the URLs including the target URL segment combination; and judging the validity of each parameter in the target URL fragment combination based on the extracted preset number of URLs. Furthermore, for each target URL segment combination, a preset number of URLs distributed at different internet locations may be extracted from URLs including the target URL segment combination, and validity of each parameter in the target URL segment group may be determined based on the extracted URLs.
S140: and aiming at each target URL fragment combination, judging the validity of each URL parameter in the target URL fragment group based on the URL containing the target URL fragment combination.
When the validity of each parameter in the target URL segment group is judged, for each target URL segment combination, the change conditions of the web page contents before and after each parameter of the URL is removed are compared with the URL including the target URL segment combination, and if the web page contents before and after a certain parameter is removed are consistent, the parameter corresponding to the parameter in the target URL segment group is determined to be an invalid parameter. For example, for the URL segment combination in Table one:
forum.php+authorid,mod,page,tid,sid
a certain number of URL samples including the URL segment combinations may be selected, for each URL sample, the page content corresponding to the URL may be obtained as the first page content, then one of the URL parameters is removed, such as the parameter authored, and the other parameters are retained, the second page content corresponding to the URL from which the parameter is removed is obtained, the first page content and the second page content are compared, if the result is not consistent, the removed URL parameter authored is an effective parameter, and if the result is consistent, the removed URL parameter sid is an invalid parameter. And then, after the parameter mod is removed and the parameters of authority, page, tid and sid are reserved, obtaining the content of a third page corresponding to the URL after the parameter is removed, and accordingly determining the effectiveness of the parameter mod according to the comparison result of the content of the third page and the content of the first page. After all the parameters respectively determine validity, further, each URL parameter in the target URL segment combination, validity information of the parameter, and a file name of a dynamic file in the website, forum.
The method for identifying the invalid parameters in the uniform resource locator URL provided by the embodiment of the present invention is described in detail above, and by the method, URLs of a plurality of web page links can be obtained; extracting URL fragment combinations from the obtained URLs of the plurality of webpage links; counting the occurrence frequency of each URL segment combination, and determining the URL segment combination with the occurrence frequency meeting the preset conditions as a target URL segment combination; and aiming at each target URL fragment combination, judging the validity of each URL parameter in the target URL fragment group based on the URL containing the target URL fragment combination. Based on the acquired URLs of the plurality of webpage links, the same URL segment combination meeting the conditions is judged at one time through statistics and filtering, and the problems that when a search engine traditionally identifies invalid parameters in repeated links, the collected invalid parameters of all the links need to be detected, all the possibilities of each parameter need to be exhausted, and the judgment is respectively carried out one by one, and the identification efficiency is low are solved. The method and the device achieve the purposes of quickly identifying the parameters in the invalid links and improving the efficiency of identifying the repeated links. When the method is applied to a search engine to capture the content of a webpage, the invalid parameters in the website can be firstly identified by applying the method, so that whether the webpage of the website is the same as other captured websites or not is quickly judged, and the information capturing efficiency of the search engine is improved.
Corresponding to the method for identifying invalid parameters in a uniform resource locator URL provided in the embodiment of the present invention, an apparatus for identifying invalid parameters in a uniform resource locator URL is also provided, and referring to fig. 2, the apparatus may include:
a URL obtaining unit 310 for obtaining URLs of a plurality of web page links;
a URL segment combination extracting unit 220, configured to extract URL segment combinations from URLs of the acquired multiple web page links, respectively;
a counting unit 320, configured to count occurrence frequencies of the URL segment combinations, and determine a URL segment combination whose occurrence frequency meets a preset condition as a target URL segment combination; and
the validity determining unit 330 is adapted to determine, for each target URL segment combination, validity of each URL parameter in the target URL segment group based on the URL including the target URL segment combination.
The apparatus may further include: a saving unit, configured to save a result obtained by judging the validity of each URL parameter in the target URL segment group by the validity judging unit 330 as an invalid segment combination list; the URL extraction unit 210 to be detected obtains a URL address to be detected corresponding to the web link to be detected;
when the invalid parameter in the URL address is identified, the URL address to be detected corresponding to the web link to be detected may be first obtained by the URL extracting unit 210 to be detected, and the URL address to be detected may be captured by the search engine server and transmitted to the URL extracting unit 210 to be detected, or the URL extracting unit 210 to be detected may capture the URL to be detected on the internet.
A URL fragment combination extracting unit 220, coupled to the URL to be tested extracting unit 210, for extracting a URL fragment combination from the URL address to be tested;
the URL segment combination extracting unit 220 extracts a URL segment combination from the URL address to be detected, specifically, the URL segment combination extracting unit 220 extracts a file name and a corresponding URL parameter of a dynamic file included in the URL address to be detected from the URL address to be detected, and combines the extracted file name and the corresponding URL parameter of the dynamic file to be used as the URL segment combination. In the pages with different URLs and the same main content of the web page, the dynamic web page technology is mostly used, and the URL of such a page often includes the file name of the program file that runs dynamically and the parameters used by the program. Therefore, the file name of the dynamic file in the URL and the corresponding URL parameter can be extracted by the URL segment combination extracting unit 220, and further, the extracted dynamic file name and each parameter can be combined by the URL segment combination extracting unit 220 as a URL segment combination. The dynamic file names and the parameters in the URL fragment combination have corresponding relations.
The URL parameter detecting unit 230, coupled to the URL segment combination extracting unit 220, determines validity of URL parameters in the URL segment combination according to the invalid segment combination list.
By the URL parameter detecting unit 230, the validity of the URL parameter in the URL segment combination can be determined according to the invalid segment combination list. In the invalid fragment combination list, the invalid fragment combination, the dynamic file name in the combination and the validity information of each corresponding URL parameter can be stored.
In practical application, the invalid segment combination list may store a certain number of invalid segment combinations, and the URL parameter detection unit 230 may query the invalid segment combination list with the URL segment combination of the URL to be detected, and query whether a matching invalid segment combination exists in the invalid segment combination list; if the URL exists, judging the validity of the URL parameters in the URL fragment combination according to the matched invalid fragment combination and the validity information of each URL parameter in the URL fragment combination.
The list of invalid fragment combinations used in this process may be built by: as shown in fig. 3, the URL obtaining unit 310 may obtain URLs of a plurality of web page links and output each obtained URL to the URL segment combination extracting unit 220, the URL segment combination extracting unit 220 extracts a URL segment combination from the obtained URLs of the plurality of web page links, respectively, and the counting unit 320 may count occurrence frequencies of the URL segment combinations and determine the URL segment combination whose occurrence frequency meets a preset condition as a target URL segment combination; and an effectiveness determining unit 330, coupled to the counting unit 320, wherein the effectiveness determining unit 330 determines the effectiveness of each URL parameter in the target URL segment set for each target URL segment combination based on the URL including the target URL segment combination. The process of acquiring a plurality of web page links and creating an invalid fragment combination list by the URL acquiring unit 310 may be regarded as a process of extracting URL fragment combinations in a limited number of URLs and determining whether each parameter is valid, thereby creating a determination sample. The URL which does not occur frequently or has a low click rate or contains a specific fragment combination can be filtered out through the statistical unit, so that an invalid fragment combination list is generated only by the URL which occurs frequently or has a high click rate or contains the specific fragment combination, and invalid fragment combinations in the invalid fragment combination list are enabled to be more universal and have wider applicability. The statistic unit 320 may specifically include a first statistic subunit 410 as shown in fig. 4, or a second statistic subunit 410 as shown in fig. 4.
As shown in fig. 4, the counting unit 320 may include a first counting subunit 410, where the first counting subunit 410 counts the number of URLs with the same URL segment combination, determines the number as the occurrence frequency of the URL segment combination, and determines the URL segment combination with the occurrence frequency meeting the preset condition as the target URL segment combination. When the generated invalid segment combination list reaches a certain number, the first statistics subunit 410 counts the number of URLs containing the same URL segment combination, which can be understood as the occurrence frequency of the URL segment combination in the internet is higher as the number is higher, and the invalid segment combination extracted from the URL segment combination has wider practicability, and vice versa.
Or,
as shown in fig. 5, the counting unit 320 may include a second counting subunit 420, counting the number of different internet locations corresponding to the same URL segment combination through the second counting subunit 420, determining the number as the occurrence frequency of the URL segment combination, and determining the URL segment combination whose occurrence frequency meets the preset condition as the target URL segment combination; wherein the internet location is determined by a network path in the URL. The number of different internet positions corresponding to the same URL segment combination can represent that the URL segment combination is used in different websites or different sub-sites of the same website, and objectively reflects the occurrence frequency of the URL segment combination in the internet, the more the number of the URL segment combination, the higher the occurrence frequency, the more the invalid segment combination extracted from the URL segment combination has wide practicability, and vice versa.
When the validity determining unit 330 specifically determines the validity of each URL parameter in the target URL segment combination, the validity determining unit may perform the determination based on the URL including the target URL segment combination, specifically obtain corresponding first page content according to the URL, then respectively remove each URL parameter (and parameter value) in the URL, then obtain corresponding second page content after removing each URL parameter (and parameter value) in the URL, compare the first page content with the second page content, if the result is inconsistent, the removed URL parameter is a valid parameter, and if the result is consistent, the removed URL parameter is an invalid parameter. Furthermore, the URL parameters and the validity information of the parameters in the target URL segment combination may be stored in the invalid segment combination list in correspondence with the file name viewread.
In another implementation, the number of URLs of the same URL segment combination and the number of different internet locations corresponding to the same URL segment combination may be used to jointly determine which URL segment combinations may be used as target URL segment combinations. As shown in fig. 6, the counting unit 320 may include a third counting sub-unit 610, counting the number of URLs containing the same URL segment combination, and determining the number as the first occurrence frequency of the URL segment combination; a fourth statistics subunit 620, configured to count the number of different internet locations corresponding to the same URL segment combination, and determine the number as a second occurrence frequency of the URL segment combination; wherein the internet location is determined by the network path in the URL; and a determining subunit 630, configured to determine, as the target URL segment combination, the URL segment combination whose first frequency of occurrence and second frequency of occurrence meet the preset condition. The determining subunit 630 may include a joint frequency calculating subunit, which calculates a joint frequency of the URL segment combination according to the first frequency of occurrence, the second frequency of occurrence, and respective preset weights, and determines, as the target URL segment combination, the URL segment combination whose joint frequency meets a preset condition.
As shown in fig. 7, the validity judging unit 330 may further include a sampling unit 710 for extracting a preset number of URLs from the URLs including the target URL segment combinations; and a validity judging subunit 720 adapted to judge validity of each parameter in the target URL segment group based on the preset number of URLs extracted by the sampling unit. Because the number of URLs including a certain target URL segment combination is very large, if parameters in such target URLs are identified, all URLs including the URL segment combination are detected as detection objects, and the workload is very large, at this time, a sampling unit 710 may extract a preset number of URLs from the URLs including the target URL segment combination as URLs to be used for detection, and then a validity determination subunit 720 may determine the validity of each parameter in the target URL segment group based on the preset number of URLs extracted by the sampling unit. The efficiency of detecting individual parameters in the target URL segment is improved. Furthermore, aiming at each target URL segment combination, extracting the preset number of URLs distributed at different internet positions from the URLs comprising the target URL segment combination, and judging the effectiveness of each parameter in the target URL segment group based on the extracted URLs. The URLs distributed at different internet positions in preset number are extracted, the coverage of the detected URLs is improved, and therefore the accuracy of parameter validity judgment in URL segment combination based on the URLs is improved.
In practical applications, the URL segment combination extracting unit 220 may first determine the number of parameters corresponding to the dynamic file name included in the URL, and in general, if the URL includes one dynamic file name and at least two parameters, the URL may be distinguished from other URL segment combinations according to the file name and the corresponding at least two parameters, so that the URL segment combination extracting unit 220 may include the file name of the dynamic file and the corresponding at least two parameters in the URL, and extract the file name of the dynamic file and the corresponding parameters as the URL segment combination in the URL. Therefore, the extracted URL fragment combinations can be different from each other, and the effectiveness of the equipment is improved.
The identification device for the invalid parameters in the URL address of the URL according to the embodiment of the present invention is described in detail above, and by using the identification device, the URL of a plurality of acquired web links can be obtained, and the same URL segment combinations meeting the conditions can be determined at a time through statistics and filtering, so that the problem of low identification efficiency due to the fact that the invalid parameters of all collected links need to be detected when the search engine conventionally identifies the invalid parameters in repeated links, and all possibilities of each parameter need to be exhausted and are respectively determined one by one is solved. The method and the device achieve the purposes of quickly identifying the parameters in the invalid links and improving the efficiency of identifying the repeated links.
To better understand the apparatus and method for identifying invalid parameters in a URL address according to an embodiment of the present invention, a specific application example of the embodiment of the present invention is given below, and please refer to fig. 8, which is a schematic diagram of an application example of the embodiment of the present invention.
Firstly, parameter combinations corresponding to 'website + path + dynamic file' and URL number influenced by each parameter combination (namely URL fragment combination) are counted, namely the occurrence frequency of the URL fragment combination; and meanwhile, counting different website + path numbers corresponding to the dynamic file + parameter combination to verify the universality of the URL segment combination.
The dynamic file and parameter combination is sequenced from large to small according to the popularity and the number of the affected URLs, and the dynamic file and parameter combination with the number of the affected URLs and the popularity being more than a certain threshold value is taken, so that the process of determining the URL segment combination with the occurrence frequency meeting the preset condition as the target URL segment combination can be performed correspondingly, namely, the URL segment combinations can be jointly determined to be the target URL segment combination according to the number of the URLs of the same URL segment combination and the number of different internet positions corresponding to the same URL segment combination. Each taking n URLs distributed on different websites and paths (e.g., taking a positive integer with n > 10), and dynamically determining the threshold and number of URLs depending on the detection capability and revenue of the system. If the change condition of the web page content before and after removing each parameter of the n URLs is detected for each dynamic file + parameter combination, if the content of the n URLs is not changed, the parameter is an effective parameter, otherwise, the parameter is an ineffective parameter. Further, the result of judging the invalid parameter of the URL segment combination in the sample URL may be saved, and the rule is generated, the rule dynamic file + the parameter list: invalid parameters, rules may be stored together to form a list of invalid fragment combinations.
The algorithms and displays presented herein are not inherently related to any particular computer, virtual machine, or other apparatus. Various general purpose systems may also be used with the teachings herein. The required structure for constructing such a system will be apparent from the description above. Moreover, the present invention is not directed to any particular programming language. It is appreciated that a variety of programming languages may be used to implement the teachings of the present invention as described herein, and any descriptions of specific languages are provided above to disclose the best mode of the invention.
In the description provided herein, numerous specific details are set forth. It is understood, however, that embodiments of the invention may be practiced without these specific details. In some instances, well-known methods, structures and techniques have not been shown in detail in order not to obscure an understanding of this description.
Similarly, it should be appreciated that in the foregoing description of exemplary embodiments of the invention, various features of the invention are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the disclosure and aiding in the understanding of one or more of the various inventive aspects. However, the disclosed method should not be interpreted as reflecting an intention that: that the invention as claimed requires more features than are expressly recited in each claim. Rather, as the following claims reflect, inventive aspects lie in less than all features of a single foregoing disclosed embodiment. Thus, the claims following the detailed description are hereby expressly incorporated into this detailed description, with each claim standing on its own as a separate embodiment of this invention.
Those skilled in the art will appreciate that the modules in the device in an embodiment may be adaptively changed and disposed in one or more devices different from the embodiment. The modules or units or components of the embodiments may be combined into one module or unit or component, and furthermore they may be divided into a plurality of sub-modules or sub-units or sub-components. All of the features disclosed in this specification (including any accompanying claims, abstract and drawings), and all of the processes or elements of any method or apparatus so disclosed, may be combined in any combination, except combinations where at least some of such features and/or processes or elements are mutually exclusive. Each feature disclosed in this specification (including any accompanying claims, abstract and drawings) may be replaced by alternative features serving the same, equivalent or similar purpose, unless expressly stated otherwise.
Furthermore, those skilled in the art will appreciate that while some embodiments described herein include some features included in other embodiments, rather than other features, combinations of features of different embodiments are meant to be within the scope of the invention and form different embodiments. For example, in the following claims, any of the claimed embodiments may be used in any combination.
The various component embodiments of the invention may be implemented in hardware, or in software modules running on one or more processors, or in a combination thereof. It will be appreciated by those skilled in the art that a microprocessor or Digital Signal Processor (DSP) may be used in practice to implement some or all of the functions of some or all of the components of the identification device of invalid parameters in a uniform resource locator URL according to embodiments of the present invention. The present invention may also be embodied as apparatus or device programs (e.g., computer programs and computer program products) for performing a portion or all of the methods described herein. Such programs implementing the present invention may be stored on computer-readable media or may be in the form of one or more signals. Such a signal may be downloaded from an internet website or provided on a carrier signal or in any other form.
It should be noted that the above-mentioned embodiments illustrate rather than limit the invention, and that those skilled in the art will be able to design alternative embodiments without departing from the scope of the appended claims. In the claims, any reference signs placed between parentheses shall not be construed as limiting the claim. The word "comprising" does not exclude the presence of elements or steps not listed in a claim. The word "a" or "an" preceding an element does not exclude the presence of a plurality of such elements. The invention may be implemented by means of hardware comprising several distinct elements, and by means of a suitably programmed computer. In the unit claims enumerating several means, several of these means may be embodied by one and the same item of hardware. The usage of the words first, second and third, etcetera do not indicate any ordering. These words may be interpreted as names.
The application is operational with numerous other general purpose or special purpose computing system environments or configurations. Examples of well known computing systems, environments, and/or configurations that may be suitable for use with the computer system/server include, but are not limited to: personal computer systems, server computer systems, thin clients, thick clients, hand-held or laptop devices, microprocessor-based systems, set-top boxes, programmable consumer electronics, networked personal computers, minicomputer systems, mainframe computer systems, distributed cloud computing environments that include any of the above, and the like.
The computer system/server may be described in the general context of computer system-executable instructions, such as program modules, being executed by a computer system. Generally, program modules may include routines, programs, objects, components, logic, data structures, etc. that perform particular tasks or implement particular abstract data types. The computer system/server may be practiced in distributed cloud computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed cloud computing environment, program modules may be located in both local and remote computer system storage media including memory storage devices.

Claims (10)

1. An apparatus for identifying invalid parameters in a Uniform Resource Locator (URL), comprising:
the URL acquisition unit is suitable for acquiring URLs of a plurality of webpage links;
the URL fragment combination extraction unit is suitable for extracting the URL fragment combination from the obtained URLs of the webpage links respectively;
the statistical unit is suitable for counting the occurrence frequency of each URL segment combination and determining the URL segment combination with the occurrence frequency meeting the preset conditions as a target URL segment combination;
and the validity judging unit is suitable for judging the validity of each URL parameter in the target URL fragment group aiming at each target URL fragment combination based on the URL containing the target URL fragment combination.
2. The apparatus of claim 1, further comprising:
the storage unit is used for storing the result of judging the validity of each URL parameter in the target URL fragment group by the validity judging unit as an invalid fragment combination list;
the URL extraction unit to be detected is suitable for acquiring a URL address to be detected corresponding to the webpage link to be detected;
the URL fragment combination extraction unit is suitable for extracting a URL fragment combination from the URL address to be detected;
and the URL parameter detection unit is suitable for judging the validity of the URL parameter in the URL fragment combination according to the invalid fragment combination list.
3. The apparatus of claim 1 or 2, the URL segment combination extraction unit adapted to:
and extracting the file name of the dynamic file and the corresponding URL parameter from the URL address to be detected, and combining the extracted file name of the dynamic file and the corresponding URL parameter to be used as the URL fragment combination.
4. The apparatus according to any one of claims 1 to 3, wherein the invalid fragment combination list stores invalid fragment combinations and validity information of URL parameters in the combinations.
5. The device according to any of claims 1-4, the URL parameter detection unit adapted to: inquiring the invalid fragment combination list by the URL fragment combination, and inquiring whether a matched invalid fragment combination exists in the invalid fragment combination list;
if the URL exists, judging the validity of the URL parameters in the URL fragment combination according to the matched invalid fragment combination and the validity information of each URL parameter in the URL fragment combination.
6. A method for identifying invalid parameters in a URL (Uniform resource locator) comprises the following steps:
acquiring URLs of a plurality of webpage links;
respectively extracting the URL fragment combinations from the obtained URLs of the plurality of webpage links;
counting the occurrence frequency of each URL segment combination, and determining the URL segment combination with the occurrence frequency meeting the preset conditions as a target URL segment combination;
and aiming at each target URL fragment combination, judging the validity of each URL parameter in the target URL fragment group based on the URL containing the target URL fragment combination.
7. The method of claim 6, further comprising:
storing the result of judging the validity of each URL parameter in the target URL fragment group by the validity judging unit as an invalid fragment combination list;
acquiring a URL address to be detected corresponding to a webpage link to be detected;
extracting a URL fragment combination from the URL address to be detected;
and judging the validity of the URL parameter in the URL fragment combination according to the invalid fragment combination list.
8. The method according to claim 6 or 7, wherein the extracting of the URL segment combination from the URL address to be tested comprises:
extracting a file name of a dynamic file and a corresponding URL parameter from the URL address to be detected, and combining the extracted file name of the dynamic file and the corresponding URL parameter to be used as the URL fragment combination;
and the invalid fragment combination list stores invalid fragment combinations and validity information of each parameter in the combinations.
9. The method according to any one of claims 6 to 8, wherein the invalid fragment combination list stores invalid fragment combinations and validity information of each URL parameter in the combinations.
10. The method according to any one of claims 6-9, wherein said determining validity of URL parameters in said URL segment combinations according to invalid segment combination list comprises:
inquiring the invalid fragment combination list by the URL fragment combination, and inquiring whether a matched invalid fragment combination exists in the invalid fragment combination list;
if the URL exists, judging the validity of the URL parameters in the URL fragment combination according to the matched invalid fragment combination and the validity information of each URL parameter in the URL fragment combination.
CN201310462262.5A 2013-09-30 2013-09-30 The identification equipment and method of Invalid parameter in uniform resource position mark URL Active CN103530336B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201310462262.5A CN103530336B (en) 2013-09-30 2013-09-30 The identification equipment and method of Invalid parameter in uniform resource position mark URL
PCT/CN2014/083216 WO2015043308A1 (en) 2013-09-30 2014-07-29 Device for identifying invalid parameters in url, and device and method for identifying invalid parameters

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201310462262.5A CN103530336B (en) 2013-09-30 2013-09-30 The identification equipment and method of Invalid parameter in uniform resource position mark URL

Publications (2)

Publication Number Publication Date
CN103530336A true CN103530336A (en) 2014-01-22
CN103530336B CN103530336B (en) 2017-09-15

Family

ID=49932345

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310462262.5A Active CN103530336B (en) 2013-09-30 2013-09-30 The identification equipment and method of Invalid parameter in uniform resource position mark URL

Country Status (1)

Country Link
CN (1) CN103530336B (en)

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2015043308A1 (en) * 2013-09-30 2015-04-02 北京奇虎科技有限公司 Device for identifying invalid parameters in url, and device and method for identifying invalid parameters
CN105095463A (en) * 2015-07-30 2015-11-25 北京奇虎科技有限公司 Method, device and system for patrolling material link addresses
CN105630983A (en) * 2015-12-28 2016-06-01 努比亚技术有限公司 Resource obtaining and optimizing device and method
CN105787038A (en) * 2016-02-25 2016-07-20 北京搜狗科技发展有限公司 Method and electronic equipment for exploring transformation rule of uniform resource locators
CN106095979A (en) * 2016-06-20 2016-11-09 百度在线网络技术(北京)有限公司 URL merging treatment method and apparatus
CN107368399A (en) * 2017-06-28 2017-11-21 武汉斗鱼网络科技有限公司 Webpage monitoring method and system on a kind of line
CN108388796A (en) * 2018-02-24 2018-08-10 深圳壹账通智能科技有限公司 Dynamic domain name verification method, system, computer equipment and storage medium
CN109359250A (en) * 2018-08-31 2019-02-19 阿里巴巴集团控股有限公司 Uniform resource locator processing method, device, server and readable storage medium storing program for executing
CN109670123A (en) * 2018-12-28 2019-04-23 杭州迪普科技股份有限公司 A kind of method and apparatus of data processing
CN113434790A (en) * 2021-06-16 2021-09-24 北京百度网讯科技有限公司 Method and device for identifying repeated links and electronic equipment

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080172738A1 (en) * 2007-01-11 2008-07-17 Cary Lee Bates Method for Detecting and Remediating Misleading Hyperlinks
CN101702179A (en) * 2009-12-01 2010-05-05 百度在线网络技术(北京)有限公司 Method and device for removing duplication from data mining
CN103020266A (en) * 2012-12-25 2013-04-03 北京奇虎科技有限公司 Method and device for extracting webpage text content
CN103077250A (en) * 2013-01-28 2013-05-01 人民搜索网络股份公司 Method and device for capturing webpage content
US8533206B1 (en) * 2008-01-11 2013-09-10 Google Inc. Filtering in search engines

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080172738A1 (en) * 2007-01-11 2008-07-17 Cary Lee Bates Method for Detecting and Remediating Misleading Hyperlinks
US8533206B1 (en) * 2008-01-11 2013-09-10 Google Inc. Filtering in search engines
CN101702179A (en) * 2009-12-01 2010-05-05 百度在线网络技术(北京)有限公司 Method and device for removing duplication from data mining
CN103020266A (en) * 2012-12-25 2013-04-03 北京奇虎科技有限公司 Method and device for extracting webpage text content
CN103077250A (en) * 2013-01-28 2013-05-01 人民搜索网络股份公司 Method and device for capturing webpage content

Cited By (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2015043308A1 (en) * 2013-09-30 2015-04-02 北京奇虎科技有限公司 Device for identifying invalid parameters in url, and device and method for identifying invalid parameters
CN105095463B (en) * 2015-07-30 2018-09-11 北京奇虎科技有限公司 Visiting method, the apparatus and system of material chained address
CN105095463A (en) * 2015-07-30 2015-11-25 北京奇虎科技有限公司 Method, device and system for patrolling material link addresses
CN105630983A (en) * 2015-12-28 2016-06-01 努比亚技术有限公司 Resource obtaining and optimizing device and method
CN105787038A (en) * 2016-02-25 2016-07-20 北京搜狗科技发展有限公司 Method and electronic equipment for exploring transformation rule of uniform resource locators
CN105787038B (en) * 2016-02-25 2019-04-30 北京搜狗科技发展有限公司 A kind of method and electronic equipment for excavating uniform resource locator transformation rule
CN106095979B (en) * 2016-06-20 2020-05-08 百度在线网络技术(北京)有限公司 URL merging processing method and device
CN106095979A (en) * 2016-06-20 2016-11-09 百度在线网络技术(北京)有限公司 URL merging treatment method and apparatus
CN107368399A (en) * 2017-06-28 2017-11-21 武汉斗鱼网络科技有限公司 Webpage monitoring method and system on a kind of line
CN108388796A (en) * 2018-02-24 2018-08-10 深圳壹账通智能科技有限公司 Dynamic domain name verification method, system, computer equipment and storage medium
WO2019161658A1 (en) * 2018-02-24 2019-08-29 深圳壹账通智能科技有限公司 Dynamic domain name validation method and system, and computer device and storage medium
CN109359250A (en) * 2018-08-31 2019-02-19 阿里巴巴集团控股有限公司 Uniform resource locator processing method, device, server and readable storage medium storing program for executing
CN109359250B (en) * 2018-08-31 2022-05-31 创新先进技术有限公司 Uniform resource locator processing method, device, server and readable storage medium
CN109670123A (en) * 2018-12-28 2019-04-23 杭州迪普科技股份有限公司 A kind of method and apparatus of data processing
CN113434790A (en) * 2021-06-16 2021-09-24 北京百度网讯科技有限公司 Method and device for identifying repeated links and electronic equipment
CN113434790B (en) * 2021-06-16 2023-07-25 北京百度网讯科技有限公司 Method and device for identifying repeated links and electronic equipment

Also Published As

Publication number Publication date
CN103530336B (en) 2017-09-15

Similar Documents

Publication Publication Date Title
CN103530336B (en) The identification equipment and method of Invalid parameter in uniform resource position mark URL
CN103530337B (en) Identify the device and method of Invalid parameter in uniform resource position mark URL
Zhao Web scraping
CN101655868B (en) Network data mining method, network data transmitting method and equipment
CN104935605B (en) The detection method of fishing website, apparatus and system
CN106095979B (en) URL merging processing method and device
CN104699704B (en) Content pushing and receiving method, device and system
CN107145779B (en) Method and device for identifying offline malicious software log
CN102710795B (en) Hotspot collecting method and device
WO2014187120A1 (en) Method for detecting brand counterfeit websites based on webpage icon matching
US20140325596A1 (en) Authentication of ip source addresses
CN108768921B (en) Malicious webpage discovery method and system based on feature detection
CN107294919A (en) A kind of detection method and device of horizontal authority leak
CN102984161B (en) The recognition methods of a kind of reliable website and device
CN105224691A (en) A kind of information processing method and device
CN110020161B (en) Data processing method, log processing method and terminal
CN105187439A (en) Phishing website detection method and device
CN107944001A (en) Hot news detection method and device and electronic equipment
CN106202297A (en) Identify the method and device of user interest
CN113453076B (en) User video service quality evaluation method, device, computing equipment and storage medium
CN108171053B (en) Rule discovery method and system
CN111125704A (en) Webpage Trojan horse recognition method and system
CN114760216B (en) Method and device for determining scanning detection event and electronic equipment
CN106411879B (en) A kind of acquisition methods and device of software identification feature
CN104021143A (en) Method and device for recording webpage access behavior

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right

Effective date of registration: 20220711

Address after: Room 801, 8th floor, No. 104, floors 1-19, building 2, yard 6, Jiuxianqiao Road, Chaoyang District, Beijing 100015

Patentee after: BEIJING QIHOO TECHNOLOGY Co.,Ltd.

Address before: 100088 room 112, block D, 28 new street, new street, Xicheng District, Beijing (Desheng Park)

Patentee before: BEIJING QIHOO TECHNOLOGY Co.,Ltd.

Patentee before: Qizhi software (Beijing) Co., Ltd

TR01 Transfer of patent right