CN109359250A - Uniform resource locator processing method, device, server and readable storage medium storing program for executing - Google Patents

Uniform resource locator processing method, device, server and readable storage medium storing program for executing Download PDF

Info

Publication number
CN109359250A
CN109359250A CN201811014600.8A CN201811014600A CN109359250A CN 109359250 A CN109359250 A CN 109359250A CN 201811014600 A CN201811014600 A CN 201811014600A CN 109359250 A CN109359250 A CN 109359250A
Authority
CN
China
Prior art keywords
url
section
target
field
dynamic
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201811014600.8A
Other languages
Chinese (zh)
Other versions
CN109359250B (en
Inventor
刘宇江
张园超
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Advanced New Technologies Co Ltd
Advantageous New Technologies Co Ltd
Original Assignee
Alibaba Group Holding Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alibaba Group Holding Ltd filed Critical Alibaba Group Holding Ltd
Priority to CN201811014600.8A priority Critical patent/CN109359250B/en
Publication of CN109359250A publication Critical patent/CN109359250A/en
Application granted granted Critical
Publication of CN109359250B publication Critical patent/CN109359250B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

This specification embodiment provides a kind of uniform resource locator processing method, first based on same requesting method, with domain name and in each target URL of depth, the quantity of different field content corresponding to same section of rank, identify the target dynamic section of path sections in respective objects URL, then again based on the target dynamic section identified, duplicate removal processing is carried out to url data to be processed, influence of the target dynamic section for being conducive to that path sections is avoided to be included to subsequent duplicate removal processing, to improve the accuracy of URL duplicate removal.

Description

Uniform resource locator processing method, device, server and readable storage medium storing program for executing
Technical field
This specification embodiment is related to technical field of data processing more particularly to a kind of uniform resource locator processing side Method, device, server and readable storage medium storing program for executing.
Background technique
In order to preferably protect WEB system from external invasion and attack, need to do some security sweeps to WEB system And Hole Detection.When being scanned to WEB system, in order to improve scan efficiency, the scanning of counterweight complicated target is avoided, elder generation is needed Duplicate removal is carried out to scanning destination address.Accordingly, it is desirable to provide a kind of scheme for carrying out duplicate removal to scanning destination address.
Summary of the invention
This specification embodiment provides a kind of uniform resource locator processing method, device, server and readable storage Medium.
In a first aspect, this specification embodiment provides a kind of uniform resource locator processing method, comprising: will be to be processed Uniform resource position mark URL data in the path sections of each target URL be divided into one or more fields, and determine each The depth of the section rank of the field and the target URL, wherein described section of rank is each field in the target URL In appearance sequence, the depth be the target URL field number;It obtains in the default identical each target URL of feature, often The corresponding field contents of a section of rank, wherein the default feature include: the corresponding requesting method of the target URL, domain name with And the depth;It counts in the difference field for corresponding to same section of rank in the identical each target URL of the default feature The quantity of appearance then determines the Road respective objects URL based on the corresponding field of this section of rank when statistical result meets preset condition The target dynamic section of path portion;According to the target dynamic section, duplicate removal processing is carried out to the url data to be processed.
Second aspect, this specification embodiment provide a kind of uniform resource locator processing unit, comprising: dividing die Block obtains module, determining module and deduplication module.Cutting module, for will be in uniform resource position mark URL data to be processed The path sections of each target URL are divided into one or more fields, and determine the section rank of each field and described The depth of target URL, wherein described section of rank is appearance sequence of each field in the target URL, and the depth is institute State the field number of target URL;Module is obtained, for obtaining in the identical each target URL of default feature, each section of rank is corresponding Field contents, wherein the default feature includes: the corresponding requesting method of the target URL, domain name and the depth; Determining module, for counting the difference field for corresponding to same section of rank in the identical each target URL of the default feature The quantity of content is then determined in respective objects URL based on the corresponding field of this section of rank when statistical result meets preset condition The target dynamic section of path sections;Deduplication module, for according to the target dynamic section, to the url data to be processed into Row duplicate removal processing.
The third aspect, this specification embodiment provide a kind of server, comprising: memory;One or more processors; And above-mentioned uniform resource locator processing unit, it is stored in the memory and is configured to by one or more processors It executes.
Fourth aspect, this specification embodiment provide a kind of computer readable storage medium, are stored thereon with computer The step of program, which realizes above-mentioned uniform resource locator processing method when being executed by processor.
This specification embodiment has the beneficial effect that:
This specification embodiment provide uniform resource locator processing method, by by the path sections of target URL into One or more fields are divided into, and determine the section rank of each field and the depth of target URL, then statistics is the same as request Method, with domain name and in each target URL of depth, the quantity of the different field content corresponding to same section of rank works as system When meter result meets preset condition, then the target of path sections in respective objects URL is determined based on the corresponding field of this section of rank Dynamic segment.Finally, carrying out duplicate removal processing to url data to be processed again based on the target dynamic section identified.It is advantageous in this way In influence of the target dynamic section for avoiding path sections from being included to subsequent duplicate removal processing, to improve the accuracy of URL duplicate removal.
Detailed description of the invention
Fig. 1 is a kind of running environment schematic diagram of this specification embodiment;
Fig. 2 is the uniform resource locator processing method flow chart that this specification embodiment first aspect provides;
Fig. 3 is the tree structure schematic diagram for the exemplary goal url list that this specification embodiment first aspect provides;
Fig. 4 is a kind of flow chart for pre- extensive step that this specification embodiment first aspect provides;
Fig. 5 is the uniform resource locator processing device structure diagram that this specification embodiment second aspect provides;
Fig. 6 is the server architecture schematic diagram that this specification embodiment third aspect provides.
Specific embodiment
In order to better understand the above technical scheme, below by attached drawing and specific embodiment to this specification embodiment Technical solution be described in detail, it should be understood that the specific features in this specification embodiment and embodiment are to this explanation The detailed description of book embodiment technical solution, rather than the restriction to this specification technical solution, in the absence of conflict, Technical characteristic in this specification embodiment and embodiment can be combined with each other.
In this specification embodiment, URL (Uniform Resource Locator, uniform resource locator) is also claimed It is the address of the resource of standard on internet for web page address, is made of the part such as agreement, domain name, path (PATH), parameter. Wherein, path portion is tapped at after the domain name part of URL, by/separate, for positioning file or function code, argument section is connect After the path sections of URL, by? separate.Dynamic URL: having part in path sections is parameter value.It is extensive that refer to will be same Parameter section in function URL is converted to preset unified identifier.
It referring to Figure 1, is a kind of running environment schematic diagram suitable for this specification embodiment.As shown in Figure 1, one or A plurality of clients 100 can be connected by network 200 with one or more Website servers 300 (only showing one in Fig. 1), with into Row data communication or interaction.Website server 300 monitors the web access requests for carrying out automatic network 200, according to web access requests Corresponding data processing is completed, and returns the result the data of webpage or extended formatting to user terminal 100.Wherein, user terminal 100 Can be PC (personal computer, PC), laptop, tablet computer, smart phone, electronic reader, Mobile unit, Web TV, wearable device etc. have the smart machine of network function.
In a first aspect, this specification embodiment provides a kind of uniform resource locator processing method.Fig. 2 is referred to, it should Method includes step S201- step S204.Wherein, step S201 to step S203 is to be included to path sections in target URL The process that is identified of dynamic segment, that is, parameter section parameter value is placed on path to avoid in the design of more and more URL patterns The accuracy of the subsequent URL duplicate removal of some effects, step S204 are based on the target dynamic section identified, to URL number to be processed According to the process for carrying out duplicate removal.Specifically each step will be described below.
The path sections of target URL each in uniform resource position mark URL data to be processed are divided by step S201 One or more fields, and determine the section rank of each field and the depth of the target URL.
It should be understood that carrying out subsequent processing to url data for convenience, before executing step S201, it is first right to need Url data to be processed carries out standardization processing.In this specification embodiment, standardization processing process be can specifically include: go Except in URL empty parameter value, extra "/" and by the letter in entire URL convert upper case or lower case etc. in removal URL.
In this specification embodiment, section rank is the appearance sequence of each field in target URL, and depth is target URL's Field number.For example, target URL are as follows: HTTP: //www.example.com/a/b/c? pa=1&pb=2&pc=3, by rule After generalized processing, it is assumed that by the conversion of the letter of entire URL for small letter, then it represents that are as follows: http://www.example.com/ A/b/c? pa=1&pb=2&pc=3, wherein domain name are as follows: www.example.com, path sections are /a/b/c, if according to oblique Thick stick "/" carries out cutting to path sections, so that it may obtain three fields, be respectively as follows: a, b and c, then the depth of target URL is 3.Wherein, the appearance sequence of field a is 1, and the appearance sequence of field b is 2, and the appearance sequence of field c is 3, then correspondingly, field The section rank of a is 1, and the section rank of field b is 2, and the section rank of field c is 3.
In one embodiment of this specification, before executing step S201, the method can also include prescreening step, For filtering out in url data to be processed, the biggish URL of probability for being dynamic URL is carried out subsequent extensive as target URL Step, to improve extensive efficiency.Specifically, prescreening step may include: obtained from website visitation data it is to be processed Url data;The domain name for obtaining URL to be processed counts and occurs the quantity of URL under each domain name;There is the quantity of URL in screening Greater than preset quantity threshold value domain name as target domain name, using the URL to be processed of the aiming field under one's name as the target URL.Wherein, preset quantity threshold value is arranged according to practical application scene, for example, can be to be processed according to acquired in reality The factors setting such as path number distribution under URL quantity, the domain name quantity of WEB system and each domain name.
It should be understood that the quantity for occurring URL under each domain name is the different number of paths under each domain name, quantity is got over It is more, then it under these domain names there are the probability of dynamic URL is larger, therefore filters out the URL under these domain names and carries out subsequent extensive, have Conducive to the extensive efficiency of raising.
Specifically, can mainly two methods be used by obtaining url data to be processed: the first, pass through web crawlers The url data in Website server is obtained, second, obtains url data to be processed from the WEB log of web page access.
Certainly, in the other embodiments of this specification, acquired url data to be processed can also be regard as mesh Mark URL.
In addition, as a kind of optional embodiment, since the function amount of access indicated when certain static path fields is certain In very small situation, it is more likely that can be influenced, cause by accidentally extensive by other dynamic fields under same section rank.Therefore, In order to correct to such situation, a monitoring mechanism can be introduced.The monitoring mechanism can be in above-mentioned prescreening step The domain name for obtaining URL to be processed counts the quantity that URL occurs in each domain name and executes before, and specific implementation procedure includes: will be each URL to be processed is matched with pre-set white list, and the white list includes multiple URL;For it fails to match wait locate URL is managed, the domain name for obtaining URL to be processed is executed, counts and occur the step of quantity of URL under each domain name.
Wherein, white list includes multiple URL.URL in white list can by each URL in website visitation data into Row monitoring obtains.Since the dynamic URL of the overwhelming majority will not then occur again after access in nearest 1-2 days.Therefore, may be used To monitor the case where each URL occurs daily, if occurring number of days in Q days more than T days, then white list being added in the URL, no The URL is done again subsequent extensive.Wherein, Q and T is arranged all in accordance with practical application scene, for example, Q, which can be set to 5, T, to be set It is set to 2.
Further, in order to guarantee the applicability of above-mentioned white list, dynamic real-time update can also be carried out to the white list. At this point, this method further include: monitor the website visitation data, acquisition number of days occurs more than preset number of days within a preset period of time URL, acquired URL is added in the white list.
Step S202 is obtained in the default identical each target URL of feature, the corresponding field contents of each section of rank.
Wherein, default feature includes: the depth of requesting method, domain name and URL.Requesting method refers to user terminal to website In the request that server issues, to the requesting method of resource.For example, by taking HTTP request as an example, requesting method include: OPTIONS, HEAD, GET, POST, PUT, DELETE, TRACE and CONNECT, wherein more common requesting method is in practical applications GET and POST.
Step S203 is counted and is corresponded to described in the difference of same section of rank in the identical each target URL of the default feature The quantity of field contents then determines respective objects based on the corresponding field of this section of rank when statistical result meets preset condition The target dynamic section of path sections in URL;
For all target URL in url data to be processed, searching wherein has same request method, same domain name And the target URL of same depth.Assuming that the target URL of corresponding Mr. Yu's requesting method, certain domain name and certain depth has m, then Each section of rank corresponds to m field contents, and each target URL in respectively this m target URL is in the field of this section of rank Hold.Wherein, m is positive integer.For in the target URL with same request method, same domain name and same depth, Mei Geduan The corresponding field contents of rank execute step S203.
Assuming that corresponding Mr. Yu's requesting method, certain domain name and certain depth m target URL the corresponding m of a certain section of rank In a field contents, including m1A field A, m2A field B and m3A field C, field A, field B and field C are different words Section content, and m1+m2+m3=m is then 3 corresponding to the quantity of the different field content of this section of rank, i.e. this section of rank is corresponding Statistical result is 3.
It is when statistical result meets preset condition, then true based on the corresponding field of this section of rank in this specification embodiment The process for determining the target dynamic section of path sections in respective objects URL specifically includes: judging whether the statistical result is greater than One preset threshold, if so, determining the target dynamic of path sections in respective objects URL based on the corresponding field of this section of rank Section.Certainly, in the other embodiments of this specification, the other preset conditions of setting also be can according to need.
Specifically, in above process, need to count all same requesting methods, same to domain name, the target URL with depth In, the quantity of the different field content corresponding to same section of rank.For example, 10000 target URL are shared, wherein same requesting party Method, same domain name have 500 groups with the target URL of depth, and the requesting method of same group of target URL, domain name and depth are all the same, The requesting method of difference group target URL, domain name are different with the one or more of them in depth.It should be noted that needs pair The quantity of the different field content of each of every group of target URL sections of rank is counted, and every group of target URL is then directed to, will be every The quantity of the different field content of a section of rank is compared with the first preset threshold, if some section of grade in certain group target URL The quantity of other different field content is greater than the first preset threshold, then determines this group of target based on the corresponding field of this section of rank The target dynamic section that path sections include in URL.
Assuming that wherein one group of same requesting method, same to domain name, with depth target url list it is as follows:
HTTP://www.example.com/Node0/Node1/param0/Node4
HTTP://www.example.com/Node0/Node1/param1/Node4
HTTP://www.example.com/Node0/Node1/.../Node4
HTTP://www.example.com/Node0/Node1/paramN/Node4
HTTP://www.example.com/Node0/Node2/Node3/Node5
HTTP://www.example.com/Node0/Node2/Node3/Node6
The path sections of these targets URL can approximation regard a tree as, wherein tree structure and each descriptive characteristics are such as Shown in Fig. 3.The depth of these targets URL is 4, wherein the node number that section rank is 1 is 1, i.e. this section of rank difference word Section number is 1;There are two the node number that section rank is 2 is total, i.e., the quantity of this section rank different field content is 2, respectively Node1 and Node2;The node number that section rank is 3 shares N+2, i.e., the quantity of this section rank different field content is N+2, Respectively param0, param1 ..., paramN and Node3, wherein N is the positive integer greater than 1;The node that section rank is 4 Number shares 3, i.e., the quantity of this section rank different field content is 3, respectively Node4, Node5 and Node6.
Specifically, the first preset threshold can be arranged according to practical application scene.For example, if the first preset threshold is set as 100, and N+2 is greater than 100, then based in these targets URL, the corresponding field of section rank 3 determines path portion in this group of target URL The target dynamic section that subpackage contains.Section the corresponding field of rank 3 include: 2 param0,1 param1 ..., 1 paramN and 200 Node3.
It is above-mentioned that path portion in respective objects URL is determined based on the corresponding field of this section of rank as a kind of optional mode The target dynamic section divided can specifically include: above-mentioned statistical result is greater than to the corresponding target of section rank of the first preset threshold In URL, the corresponding field of this section of rank is used as the target dynamic section that path sections include in respective objects URL.That is, By in above-mentioned example, the corresponding field of section rank 3 is used as the target dynamic section that path sections include in this group of target URL.
Certainly, in order to further decrease the identification error of the target dynamic section for including to path sections, subsequent duplicate removal is improved Accuracy, it is above-mentioned that path portion in respective objects URL is determined based on the corresponding field of this section of rank as another optional way The target dynamic section divided can specifically include: be with reference to dynamic segment by the corresponding field mark of this section of rank;Obtain the reference The reference dynamic segment that described section of request amount meets default request amount condition is determined as the target and moved by the section request amount of dynamic segment State section.In this specification embodiment, section request amount is for each field in the identical i.e. same requesting method of default feature, same to domain name, with deep The number occurred under the same section of rank of each target URL of degree.That is, the section request amount with reference to dynamic segment is this with reference to dynamic The number that same section rank of the state section in corresponding same requesting method, same to domain name, each target URL with depth occurs.
For example, the section of field Node1 is requested as shown in figure 3, section request amount, that is, access times of field Node0 are 1000 Amount is 800, and the section request amount of field Node2 is 200, and the section request amount of field param0 to paramN is between 1-5, and word The sum of the section request amount of section param0 to paramN is 800, and the section request amount of field Node3 is 200, and the section of field Node4 is asked The amount of asking is 800, and the section request amount of field Node5 is 150, and the section request amount of field Node6 is 50.
Specifically, above-mentioned default request amount condition can be arranged according to practical application scene.For example, can be previously according to Actual scene is set as a request amount threshold value, if the section request amount with reference to dynamic segment is less than the request amount threshold value, meets pre- If request amount condition.
It is alternatively, above-mentioned that described section of request amount is met into default request amount item in this specification embodiment The process that the reference dynamic segment of part is determined as the target dynamic section can specifically include: obtain described with reference to where dynamic segment The largest segment request amount of target URL;Accounting of the described section of request amount in the largest segment request amount is less than or equal to second The reference dynamic segment of preset threshold is determined as the target dynamic section.
Specifically, the largest segment request amount for obtaining the target URL with reference to where dynamic segment includes: to judge the ginseng Examine whether each field in the target URL where dynamic segment is marked as with reference to dynamic segment;If so, by target URL's Under domain name, the quantity of the target URL of same request method and the identical depth is requested as the largest segment of target URL Amount;If it is not, then by maximum numerical value in the section request amount of each field of target URL, the largest segment as target URL is asked The amount of asking.
It should be noted that directly can first be used after having marked with reference to dynamic segment in this specification embodiment Above-mentioned largest segment request amount acquisition modes obtain each largest segment request amount for being labeled with the target URL with reference to dynamic segment, with Convenient for the subsequent largest segment request amount that can according to need acquisition respective objects URL.
Specifically, during above-mentioned determining target dynamic section, each field of each target URL can be taken turns It askes, judges whether it is with reference to dynamic segment, if so, calculating the section request amount for referring to dynamic segment where this refers to dynamic segment The accounting is compared, if the accounting is less than or waits by the accounting in the largest segment request amount of target URL with the second preset threshold In the second preset threshold, then this is determined as target dynamic section with reference to dynamic segment.Wherein, the second preset threshold can be according to reality Application scenarios setting, specifically can according to same requesting method, with domain name and in the target URL of depth, statistical result meets The corresponding different field number setting of the section rank of preset condition.For example, in the example depicted in fig. 3, the second preset threshold can To be set as 1%.
It should be understood that may be mis-marked in the presence of being influenced by other dynamic fields of same level with reference in dynamic segment Static fields.Due to same requesting method, with domain name and in the target URL of depth, the sum of the section request amount of each section of rank It is certain, as shown in figure 3, the sum of the section request amount of node that each section of rank is included is 1000, therefore, for same request Method, with domain name and with same section of rank in the target URL of depth, being marked as with reference to dynamic segment, comparatively, dynamic The section request amount of field is the section request amount well below static fields, and certainly, the section request amount of dynamic field is corresponding Accounting in largest segment request amount is also far smaller than accounting of the section request amount of static fields in corresponding largest segment request amount.
As a result, in order to further decrease the identification error for the target dynamic section that path sections include, and avoid target URL's Access times determine target dynamic field the influence of result, and the embodiment of the present invention passes through the second of setting accounting threshold value, that is, above-mentioned Preset threshold compares accounting of the section request amount of reference dynamic segment in corresponding largest segment request amount with the second preset threshold Compared with further identifying with reference to the dynamic field and static fields in dynamic segment, accounting be less than or equal to the second preset threshold Reference dynamic segment be confirmed as target dynamic section, the as parameter section of path sections, by accounting be greater than the second preset threshold ginseng It examines dynamic segment and is confirmed as static fields, as the fixed route section of path sections.
For example, in the example depicted in fig. 3, only the corresponding field param0, param1 of section rank 3 ..., paramN and Node3 is marked as with reference to dynamic segment.Wherein, the section request amount that the section request amount of field param0 is 2, param1 is 1 ..., The section request amount that the section request amount of paramN is 1, Node3 is 200.Obviously, field param0, param1 ..., paramN and The largest segment request amount of target URL where Node3 is the section request amount of field Node0, i.e., largest segment request amount is 1000.If Above-mentioned second preset threshold is set as 1%, then param0, param1 ..., paramN is confirmed to be target dynamic section, and joins Examining dynamic segment Node3 is then static fields.
In addition, it is necessary to which explanation, in practical applications, is analyzed by the addressing principle of the path sections to URL Afterwards, the addressing logic of most systems is found are as follows: preferential fixed character string of searching then is pressed such as without matched fixed character string The parameter registered in Value Types matching routing table, priority match numerical value shape parameter, such as registration without integer shape parameter then match word Symbol string class parameter.Therefore, each section of the path sections comprising parameter section can be divided into three classes: fixed route section, numeric type ginseng Several sections and character string type parameter section.
Wherein, numeric type parameter section, there is obvious feature, and relatively good identification can carry out extensive in advance.In addition, For such as date etc present in character string class parameter section, the stronger character string type of feature, can also shift to an earlier date into Row is extensive.Therefore, in order to improve the efficiency of extensive process, as a kind of optional embodiment, execute above-mentioned steps S202 it Before, this specification embodiment can also include pre- extensive step, in advance to the word of relatively good identification in the path sections of target URL Duan Jinhang is tentatively extensive, then executes step S202 again for the path sections after pre- extensive step process, continues to identify Compare the character string class parameter section being difficult to differentiate between, the target dynamic section identified is character string class parameter section.Specifically, Pre- extensive step includes the following steps S401 and/or step S402.
Step S401, judges whether each field of each target URL meets preset numeric type feature, if It is that the field is generalized for preset first identifier and is accorded with.
It should be noted that can recognize that alphabet is the word of number by matching preset numeric type feature Section.That is, if certain field meets preset numeric type feature, then it represents that alphabet included by the field is number Word is numeric type parameter section, such as " 123 ".It is accorded at this point, the field is then replaced with preset first identifier.Specifically, the first mark Knowing symbol can according to need setting, such as can be set to { NUM } either other unified identifiers.
Step S402, judges whether each field of each target URL meets preset date type feature, if It is that the field is generalized for preset second identifier and is accorded with.
Specifically, preset date type feature may include preset a variety of time formats rule, such as: 20XX-XX- XX etc..Each field of each target URL is matched with preset a variety of time format rules, when successful match, then Indicate that the field is date parameter section, such as " 2018-08-14 ".It is accorded at this point, the field is then replaced with preset second identifier. Specifically, second identifier symbol can according to need setting, such as can be set to { DT } either other unified identifiers.It needs It is noted that second identifier symbol can be set to different from first identifier symbol, may be set to be identical as first identifier symbol.
It should be noted that above-mentioned pre-treatment step also may include in character string class parameter section other than the date The existing stronger character string type field of other features carries out extensive step in advance.When pre- extensive step includes step It when S401 and step S402, can be executed according to step shown in Fig. 4, can also execute according to the step of being different from Fig. 4, such as walk Rapid S402 can be executed before step S401, or be performed simultaneously substantially with step S401.Certainly, this specification other It, can also be without above-mentioned pre- extensive step, directly by executing step S202 and step S203 in target URL in embodiment Target dynamic section, that is, parameter section of path sections is identified.
Step S204 carries out duplicate removal processing to the url data to be processed according to the target dynamic section.
Specifically, above-mentioned according to the target dynamic section, carrying out duplicate removal processing to the url data to be processed can be with It include: that the target dynamic section is generalized for preset third identifier;It obtains in the url data to be processed, each URL Corresponding parameter name, and the corresponding parameter name of each URL is ranked up;Based on after the extensive processing, Parameter name after the domain name of each URL, requesting method, path sections and sequence, removes in the url data to be processed Duplicate uniform resource locator.
Wherein, third identifier can be set according to actual needs, for example, can be set to { STR } either other systems One identifier.It should be noted that third identifier can be set to and above-mentioned first identifier in this specification embodiment Symbol and/or second identifier symbol are identical, may be set to be difference.
For example, target URL corresponding for example described in Fig. 3, to above-mentioned steps S202 identify target dynamic section into After the extensive processing of row, obtained target url list are as follows:
HTTP://www.example.com/Node0/Node1/{STR}/Node4
HTTP://www.example.com/Node0/Node2/Node3/Node5
HTTP://www.example.com/Node0/Node2/Node3/Node6
It should be noted that in this step, need to institute's pending url data progress duplicate removal, including there are The url data of target dynamic section is not present with other by the target URL of target dynamic section.Therefore, it is necessary to obtain URL number to be processed In, the corresponding parameter name of each URL, and the corresponding parameter name of each URL is ranked up.Specifically, URL pairs The parameter name answered includes the parameter name that argument section includes in URL.In addition, for also including argument section in request body Request URL, the corresponding parameter name of URL further includes the parameter name requested in body, such as the URL of POST request.Correspondingly, to every When the corresponding parameter name of a URL is ranked up, the parameter name that also includes according to argument section in URL respectively and Parameter name in POSTBODY is ranked up.It can specifically sort according to the descending sequence of parameter name letter, it can also To sort according to the ascending sequence of parameter name letter.
For example, certain URL is http://www.example.com/a/b/c? parameter in pa=1&pb=2&pc=3, URL The parameter name that part includes is respectively as follows: pa, pb and pc, right if the URL belongs to the request for not including argument section in request body The corresponding parameter name of the URL is ascending be ranked up after parameter name combination are as follows: papbpc.If the URL belongs to request In body include the request of argument section, and the parameter name for including in body is requested to be respectively as follows: pd and pe, then it is corresponding to the URL Parameter name is ascending be ranked up after parameter name combination are as follows: papbpc and pdpe, wherein papbpc correspond to should Combination after the parameter name sequence that argument section includes in URL, pdpe correspond to argument section in the corresponding request body of the URL Combination after the parameter name sequence for including.
It should be noted that obtaining in url data to be processed in this specification embodiment, the corresponding ginseng of each URL Several titles, and the execution sequence for the step of being ranked up to the corresponding parameter name of each URL can without limitation, example Such as, which can also execute before above-mentioned steps S201.
Specifically, based on after the extensive processing, need further to remove domain name in url data to be processed, The consistent repetition URL of parameter name after requesting method, path sections and sequence, the i.e. identical, requesting method corresponding to domain name The identical multiple URL of parameter name after identical, path sections are identical and sequence, only retain one of them.For example, if wait locate A1, A2 in the url data of reason ..., A200 indicate to correspond to same domain name, same requesting method, same path sections and The URL of parameter name after same sequence, then only retain A1, A2 ..., one of URL in A200.It should be noted that In url data to be processed, for including the target URL of target dynamic section, the path sections that when duplicate removal obtains is by general Path sections after change, and for not by other extensive URL, the path sections that when duplicate removal obtains are by standardization processing Original path sections afterwards.
It is above-mentioned based on after the extensive processing as a kind of optional mode, the domain name of each URL, requesting method, Parameter name after path sections and sequence, the process for removing duplicate URL in the url data to be processed can wrap It includes: the parameter name based on after the extensive processing, after the domain name of each URL, requesting method, path sections and sequence Claim, generates corresponding feature string;The url data to be processed is filtered based on the feature string, and right Each feature string is answered to retain a URL.
It should be understood that, for including the target URL of target dynamic section, being generated corresponding special in url data to be processed Levying path sections used in character string is the path sections after extensive, and for by other extensive URL, not generating Path sections used in character pair character string are original path sections after standardization processing.For example, for above-mentioned The target URL:HTTP after extensive: //www.example.com/Node0/Node1/ { STR }/Node4 in example is raw At path sections used in corresponding feature string are as follows:/Node0/Node1/ { STR }/Node4.
In this specification embodiment, can will based on after the extensive processing, the domain name of each URL, requesting method, Parameter name after path sections and sequence, is combined according to preconfigured combination, obtains corresponding tagged word Symbol string.For example, each URL can be directed to, directly the parameter name after domain name, requesting method, path and sequence is successively spelled It is connected together, forms corresponding with URL feature string, splicing sequentially can be previously according to needing to configure.
At this point, being filtered based on the feature string to the url data to be processed, and corresponding each spy Levy character string retain the detailed process of a URL can be with are as follows: pass through preset HASH function and calculate the corresponding tagged word of each URL The HASH value for according with string carries out duplicate removal by HASH value, the URL after obtaining duplicate removal.That is, multiple URL correspond to together if it exists One HASH value, then only retain one of URL.
It, can also be based on the parameter after domain name, requesting method, path sections and sequence as another optional mode Title removes duplicate URL using unique key mechanism.
In addition, it should also be noted that, above-mentioned duplicate removal process can not also be by institute in the other embodiments of this specification The target dynamic section identified replaces with preset third identifier, but directly marks to the target dynamic section identified Note, it is all the same for the parameter name after domain name, requesting method, depth and sequence in this way and labeled in same section rank There are multiple URL of target dynamic section, if other fields of path sections are all the same other than target dynamic section, is then considered It is duplicate URL, only retains one in these URL.
This specification embodiment provide uniform resource locator processing method, first count same requesting method, with domain name with And in each target URL of depth, the quantity of the different field content corresponding to same section of rank is preset when statistical result meets When condition, the target dynamic section of path sections in respective objects URL is determined based on the corresponding field of this section of rank, is then based on again The target dynamic section identified, realize pending url data duplicate removal processing.Be conducive to avoid path sections in this way Influence of the target dynamic section for being included to subsequent duplicate removal processing, to improve the accuracy of URL duplicate removal.
Second aspect, based on the same inventive concept, this specification embodiment additionally provide at a kind of same Resource Locator Device is managed, Fig. 5 is referred to, comprising:
Cutting module 51, for by the path sections of target URL each in uniform resource position mark URL data to be processed One or more fields are divided into, and determine the section rank of each field and the depth of the target URL, wherein institute Stating section rank is appearance sequence of each field in the target URL, and the depth is the field number of the target URL;
Module 52 is obtained, for obtaining in the identical each target URL of default feature, in the corresponding field of each section of rank Hold, wherein the default feature includes: the corresponding requesting method of the target URL, domain name and the depth;
Determining module 53 corresponds to same section of rank not for counting in the identical each target URL of the default feature With the quantity of the field contents, when statistical result meets preset condition, then phase is determined based on the corresponding field of this section of rank Answer the target dynamic section of path sections in target URL;
Deduplication module 54, for carrying out duplicate removal processing to the url data to be processed according to the target dynamic section.
As an alternative embodiment, the determining module 53 includes:
Judging submodule 531, for judging whether the statistical result is greater than the first preset threshold;
Submodule 532 is determined, for determining that the statistical result is greater than the first preset threshold when the judging submodule 531 When, the target dynamic section of path sections in respective objects URL is determined based on the corresponding field of this section of rank.
As an alternative embodiment, the determining submodule 532 is specifically used for:
It is with reference to dynamic segment by the corresponding field mark of this section of rank;
The section request amount with reference to dynamic segment is obtained, the reference that described section of request amount is met default request amount condition is dynamic State section is determined as the target dynamic section, wherein described section of request amount is each field in the identical each target of the default feature The number occurred under the same section of rank of URL.
As an alternative embodiment, the determining submodule 532 is specifically used for:
Obtain the largest segment request amount of the target URL with reference to where dynamic segment;
Accounting of the described section of request amount in the largest segment request amount is less than or equal to the reference of the second preset threshold Dynamic segment is determined as the target dynamic section.
As an alternative embodiment, the determining submodule 532 is specifically used for:
Judge whether each field is marked as with reference to dynamic segment in the target URL with reference to where dynamic segment;
If so, by under the domain name of target URL, the number of the target URL of same request method and the identical depth Amount is used as the largest segment request amount;
If it is not, then maximum numerical value in the section request amount of each field of target URL is requested as the largest segment Amount.
As an alternative embodiment, the uniform resource locator processing unit that this specification embodiment provides also is wrapped It includes:
First extensive module, for judging whether each field of each target URL meets preset numeric type Feature accords with if so, the field is generalized for preset first identifier.
As an alternative embodiment, the uniform resource locator processing unit that this specification embodiment provides also is wrapped It includes:
Second extensive module, for judging whether each field of each target URL meets preset date type Feature accords with if so, the field is generalized for preset second identifier.
As an alternative embodiment, the uniform resource locator processing unit that this specification embodiment provides also is wrapped It includes: screening module, for obtaining url data to be processed from website visitation data;The domain name of URL to be processed is obtained, is counted Occurs the quantity of URL under each domain name;The quantity that URL occurs in screening is greater than the domain name of preset quantity threshold value as aiming field Name, using the URL to be processed of the aiming field under one's name as the target URL.
As an alternative embodiment, the screening module is also used to: by each URL to be processed with it is pre-set White list is matched, and the white list includes multiple URL;For the URL to be processed that it fails to match, described obtain wait locate is executed The domain name for managing URL, counts and occurs the step of quantity of URL under each domain name.
As an alternative embodiment, the uniform resource locator processing unit that this specification embodiment provides also is wrapped Include: monitoring module obtains for monitoring the website visitation data and occurs number of days within a preset period of time more than preset number of days Acquired URL is added in the white list URL.
As an alternative embodiment, the deduplication module 54 includes:
Extensive submodule 541, for the target dynamic section to be generalized for preset third identifier;
Acquisition submodule 542, for obtaining in the url data to be processed, the corresponding parameter name of each URL, and The corresponding parameter name of each URL is ranked up;
Duplicate removal submodule 543, for being based on the domain name of each URL, requesting method, path after the extensive processing Parameter name behind part and sequence, removes duplicate URL in the url data to be processed.
As an alternative embodiment, the duplicate removal submodule 543 is specifically used for:
Ginseng based on after the extensive processing, after the domain name of each URL, requesting method, path sections and sequence Several titles generate corresponding feature string;
The url data to be processed is filtered based on the feature string, and corresponding each tagged word Symbol string retains a URL.
It should be noted that uniform resource locator processing unit provided by this specification embodiment, wherein each list The concrete mode that member executes operation is described in detail in above method embodiment, will be not set forth in detail herein It is bright.
The third aspect is based on inventive concept same as uniform resource locator processing method in previous embodiment, this hair It is bright that a kind of server is also provided, including memory, one or more processors and storage are on a memory and can be on a processor The computer program of operation, the processor realize uniform resource locator processing method described previously when executing described program Step.
Fig. 6 shows a kind of structural block diagram that can be applied to the server in this specification embodiment.As shown in fig. 6, clothes Business device 600 includes: memory 601, processor 602 and network module 603.
Memory 601 can be used for storing software program and module, such as the unified resource positioning in this specification embodiment Accord with the corresponding program instruction/module of processing method and processing device, the software that processor 602 is stored in memory 601 by operation Program and module, thereby executing various function application and data processing, i.e. unified resource in the realization embodiment of the present invention Finger URL processing method.Memory 601 may include high speed random access memory, may also include nonvolatile memory, such as one or The multiple magnetic storage devices of person, flash memory or other non-volatile solid state memories.Further, in above-mentioned memory 601 Software program and module may also include that operating system 621 and service module 622.Wherein operating system 621 may be, for example, LINUX, UNIX, WINDOWS, may include it is various for management system task (such as memory management, storage equipment control, electricity Source control etc.) component software and/or driving, and can mutually be communicated with various hardware or component software, so that it is soft to provide other The running environment of part component.Service module 622 operates on the basis of operating system 621, and passes through the network of operating system 621 The request for carrying out automatic network is monitored in service, completes corresponding data processing according to request, and return to processing result to user terminal.Also It is to say, service module 622 is used to provide network service to user terminal.
Network module 603 is for receiving and transmitting network signal.Above-mentioned network signal may include wireless signal or have Line signal.
Server in this specification embodiment can be WEB server, database server etc..It is understood that Structure shown in fig. 6 is only to illustrate, and server 600 may also include the more perhaps less component than shown in Fig. 6 or have The configuration different from shown in Fig. 6.Each component shown in Fig. 6 can be realized using hardware, software, or its combination.In addition, this hair Server in bright embodiment can also include the server of multiple specific different function.
Fourth aspect, based on the inventive concept with uniform resource locator processing method in previous embodiment, the present invention is also A kind of computer readable storage medium is provided, computer program is stored thereon with, is realized above when which is executed by processor The step of either uniform resource locator processing method method.
This specification is referring to the method, equipment (system) and computer program product according to this specification embodiment Flowchart and/or the block diagram describes.It should be understood that can be realized by computer program instructions every in flowchart and/or the block diagram The combination of process and/or box in one process and/or box and flowchart and/or the block diagram.It can provide these computers Processor of the program instruction to general purpose computer, special purpose computer, Embedded Processor or other programmable data processing devices To generate a machine, so that generating use by the instruction that computer or the processor of other programmable data processing devices execute In setting for the function that realization is specified in one or more flows of the flowchart and/or one or more blocks of the block diagram It is standby.
These computer program instructions, which may also be stored in, is able to guide computer or other programmable data processing devices with spy Determine in the computer-readable memory that mode works, so that it includes referring to that instruction stored in the computer readable memory, which generates, Enable the manufacture of equipment, the commander equipment realize in one box of one or more flows of the flowchart and/or block diagram or The function of being specified in multiple boxes.
These computer program instructions also can be loaded onto a computer or other programmable data processing device, so that counting Series of operation steps are executed on calculation machine or other programmable devices to generate computer implemented processing, thus in computer or The instruction executed on other programmable devices is provided for realizing in one or more flows of the flowchart and/or block diagram one The step of function of being specified in a box or multiple boxes.
Although the preferred embodiment of this specification has been described, once a person skilled in the art knows basic wounds The property made concept, then additional changes and modifications may be made to these embodiments.So the following claims are intended to be interpreted as includes Preferred embodiment and all change and modification for falling into this specification range.
Obviously, those skilled in the art can carry out various modification and variations without departing from this specification to this specification Spirit and scope.In this way, if these modifications and variations of this specification belong to this specification claim and its equivalent skill Within the scope of art, then this specification is also intended to include these modifications and variations.

Claims (26)

1. a kind of uniform resource locator processing method, comprising:
The path sections of target URL each in uniform resource position mark URL data to be processed are divided into one or more words Section, and determine the section rank of each field and the depth of the target URL, wherein described section of rank is each field Appearance sequence in the target URL, the depth are the field number of the target URL;
It obtains in the default identical each target URL of feature, the corresponding field contents of each section of rank, wherein the default feature It include: the corresponding requesting method of the target URL, domain name and the depth;
Count the number for corresponding to the difference field contents of same section of rank in the identical each target URL of the default feature Amount, when statistical result meets preset condition, then determines path sections in respective objects URL based on the corresponding field of this section of rank Target dynamic section;
According to the target dynamic section, duplicate removal processing is carried out to the url data to be processed.
2. according to the method described in claim 1, described when statistical result meets preset condition, then corresponding based on this section of rank Field determine that the target dynamic sections of path sections in respective objects URL includes:
Judge whether the statistical result is greater than the first preset threshold, if so, determining phase based on the corresponding field of this section of rank Answer the target dynamic section of path sections in target URL.
3. according to the method described in claim 1, described determine path in respective objects URL based on the corresponding field of this section of rank Partial target dynamic section includes:
It is with reference to dynamic segment by the corresponding field mark of this section of rank;
The section request amount with reference to dynamic segment is obtained, described section of request amount is met to the reference dynamic segment of default request amount condition It is determined as the target dynamic section, wherein described section of request amount is each field in the identical each target URL of the default feature The number occurred under same section of rank.
4. according to the method described in claim 3, the reference dynamic that described section of request amount is met to default request amount condition Section is determined as the target dynamic section and includes:
Obtain the largest segment request amount of the target URL with reference to where dynamic segment;
Accounting of the described section of request amount in the largest segment request amount is less than or equal to the reference dynamic of the second preset threshold Section is determined as the target dynamic section.
5. according to the method described in claim 4, the largest segment request for obtaining the target URL with reference to where dynamic segment Amount includes:
Judge whether each field is marked as with reference to dynamic segment in the target URL with reference to where dynamic segment;
If so, the quantity of the target URL of same request method and the identical depth is made by under the domain name of target URL For the largest segment request amount;
If it is not, then by maximum numerical value in the section request amount of each field of target URL, as the largest segment request amount.
6. according to the method described in claim 1, described obtain in the default identical each target URL of feature, each section of rank pair Before the field contents answered, further includes:
Judge whether each field of each target URL meets preset numeric type feature, if so, the field is extensive For preset first identifier symbol.
7. according to the method described in claim 1, described obtain in the default identical each target URL of feature, each section of rank pair Before the field contents answered, further includes:
Judge whether each field of each target URL meets preset date type feature, if so, the field is extensive For preset second identifier symbol.
8. according to the method described in claim 1, described by each target in uniform resource position mark URL data to be processed The path sections of URL are divided into before one or more fields, further includes:
Url data to be processed is obtained from website visitation data;
The domain name for obtaining URL to be processed counts and occurs the quantity of URL under each domain name;
The quantity that URL occurs in screening is greater than the domain name of preset quantity threshold value as target domain name, by the aiming field under one's name to URL is handled as the target URL.
9. according to the method described in claim 8, the domain name for obtaining URL to be processed, counts and occurs under each domain name Before the quantity of URL, further includes:
Each URL to be processed is matched with pre-set white list, the white list includes multiple URL;
For the URL to be processed that it fails to match, the domain name for obtaining URL to be processed is executed, counts to go out under each domain name The step of quantity of existing URL.
10. according to the method described in claim 9, further include:
The website visitation data is monitored, obtains and occurs the URL that number of days is more than preset number of days within a preset period of time, it will be acquired URL be added in the white list.
11. according to the method described in claim 1, described according to the target dynamic section, to the url data to be processed into Row duplicate removal processing includes:
The target dynamic section is generalized for preset third identifier;
It obtains in the url data to be processed, the corresponding parameter name of each URL, and to the corresponding parameter of each URL Title is ranked up;
Parameter name based on after the extensive processing, after the domain name of each URL, requesting method, path sections and sequence Claim, removes duplicate URL in the url data to be processed.
12. according to the method for claim 11, described based on after the extensive processing, the domain name of each URL is requested Parameter name after method, path sections and sequence, removing duplicate URL in the url data to be processed includes:
Parameter name based on after the extensive processing, after the domain name of each URL, requesting method, path sections and sequence Claim, generates corresponding feature string;
The url data to be processed is filtered based on the feature string, and corresponding each feature string Retain a URL.
13. a kind of uniform resource locator processing unit, comprising:
Cutting module, for being divided into the path sections of target URL each in uniform resource position mark URL data to be processed One or more fields, and determine the section rank of each field and the depth of the target URL, wherein described section of grade Not Wei each field in the target URL appearance sequence, the depth be the target URL field number;
Obtain module, for obtaining in the identical each target URL of default feature, the corresponding field contents of each section of rank, wherein The default feature includes: the corresponding requesting method of the target URL, domain name and the depth;
Determining module, for counting in the identical each target URL of the default feature corresponding to described in the difference of same section of rank The quantity of field contents then determines respective objects based on the corresponding field of this section of rank when statistical result meets preset condition The target dynamic section of path sections in URL;
Deduplication module, for carrying out duplicate removal processing to the url data to be processed according to the target dynamic section.
14. device according to claim 13, the determining module include:
Judging submodule, for judging whether the statistical result is greater than the first preset threshold;
Submodule is determined, for when the judging submodule determines that the statistical result is greater than the first preset threshold, being based on should The corresponding field of section rank determines the target dynamic section of path sections in respective objects URL.
15. device according to claim 13, the determining submodule is specifically used for:
It is with reference to dynamic segment by the corresponding field mark of this section of rank;
The section request amount with reference to dynamic segment is obtained, described section of request amount is met to the reference dynamic segment of default request amount condition It is determined as the target dynamic section, wherein described section of request amount is each field in the identical each target URL of the default feature The number occurred under same section of rank.
16. device according to claim 15, the determining submodule is specifically used for:
Obtain the largest segment request amount of the target URL with reference to where dynamic segment;
Accounting of the described section of request amount in the largest segment request amount is less than or equal to the reference dynamic of the second preset threshold Section is determined as the target dynamic section.
17. device according to claim 16, the determining submodule is specifically used for:
Judge whether each field is marked as with reference to dynamic segment in the target URL with reference to where dynamic segment;
If so, the quantity of the target URL of same request method and the identical depth is made by under the domain name of target URL For the largest segment request amount;
If it is not, then by maximum numerical value in the section request amount of each field of target URL, as the largest segment request amount.
18. device according to claim 13, further includes:
First extensive module, for judging whether each field of each target URL meets preset numeric type feature, It is accorded with if so, the field is generalized for preset first identifier.
19. device according to claim 13, further includes:
Second extensive module, for judging whether each field of each target URL meets preset date type feature, It is accorded with if so, the field is generalized for preset second identifier.
20. device according to claim 13, further includes: screening module is used for:
Url data to be processed is obtained from website visitation data;
The domain name for obtaining URL to be processed counts and occurs the quantity of URL under each domain name;
The quantity that URL occurs in screening is greater than the domain name of preset quantity threshold value as target domain name, by the aiming field under one's name to URL is handled as the target URL.
21. device according to claim 20, the screening module is also used to:
Each URL to be processed is matched with pre-set white list, the white list includes multiple URL;
For the URL to be processed that it fails to match, the domain name for obtaining URL to be processed is executed, counts to go out under each domain name The step of quantity of existing URL.
22. device according to claim 21, further includes:
Monitoring module, for monitoring the website visitation data, there is number of days more than preset number of days within a preset period of time in acquisition URL, acquired URL is added in the white list.
23. device according to claim 13, the deduplication module include:
Extensive submodule, for the target dynamic section to be generalized for preset third identifier;
Acquisition submodule, for obtaining in the url data to be processed, the corresponding parameter name of each URL, and to described every The corresponding parameter name of a URL is ranked up;
Duplicate removal submodule, for based on after the extensive processing, the domain name of each URL, requesting method, path sections and Parameter name after sequence removes duplicate URL in the url data to be processed.
24. device according to claim 23, the duplicate removal submodule is specifically used for:
Parameter name based on after the extensive processing, after the domain name of each URL, requesting method, path sections and sequence Claim, generates corresponding feature string;
The url data to be processed is filtered based on the feature string, and corresponding each feature string Retain a URL.
25. a kind of server, comprising:
Memory;
One or more processors;And
Uniform resource locator processing unit described in any one of claim 13-24 is stored in the memory simultaneously quilt It is configured to be performed by one or more processors.
26. a kind of computer readable storage medium, is stored thereon with computer program, power is realized when which is executed by processor Benefit requires the step of any one of 1-12 the method.
CN201811014600.8A 2018-08-31 2018-08-31 Uniform resource locator processing method, device, server and readable storage medium Active CN109359250B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811014600.8A CN109359250B (en) 2018-08-31 2018-08-31 Uniform resource locator processing method, device, server and readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811014600.8A CN109359250B (en) 2018-08-31 2018-08-31 Uniform resource locator processing method, device, server and readable storage medium

Publications (2)

Publication Number Publication Date
CN109359250A true CN109359250A (en) 2019-02-19
CN109359250B CN109359250B (en) 2022-05-31

Family

ID=65350418

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811014600.8A Active CN109359250B (en) 2018-08-31 2018-08-31 Uniform resource locator processing method, device, server and readable storage medium

Country Status (1)

Country Link
CN (1) CN109359250B (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110825947A (en) * 2019-10-31 2020-02-21 深圳前海微众银行股份有限公司 URL duplicate removal method, device, equipment and computer readable storage medium
CN111259282A (en) * 2020-02-13 2020-06-09 深圳市腾讯计算机系统有限公司 URL duplicate removal method and device, electronic equipment and computer readable storage medium
CN111935133A (en) * 2020-08-06 2020-11-13 北京顶象技术有限公司 White list generation method and device
CN112015483A (en) * 2020-08-07 2020-12-01 北京浪潮数据技术有限公司 POST request parameter automatic processing method and device and readable storage medium
CN112287201A (en) * 2020-12-31 2021-01-29 北京精准沟通传媒科技股份有限公司 Method, device, medium and electronic equipment for removing duplicate of crawler request
CN112492353A (en) * 2019-09-12 2021-03-12 武汉斗鱼鱼乐网络科技有限公司 Method, device and equipment for processing data in live broadcast room and storage medium
CN112804373A (en) * 2020-12-30 2021-05-14 微医云(杭州)控股有限公司 Interface domain name determining method and device, electronic equipment and storage medium
CN114553550A (en) * 2022-02-24 2022-05-27 京东科技信息技术有限公司 Request detection method and device, storage medium and electronic equipment

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7680773B1 (en) * 2005-03-31 2010-03-16 Google Inc. System for automatically managing duplicate documents when crawling dynamic documents
CN101764720A (en) * 2009-11-24 2010-06-30 福建星网锐捷网络有限公司 Method and system for testing filtration performance
CN103530336A (en) * 2013-09-30 2014-01-22 北京奇虎科技有限公司 Equipment and method for identifying invalid parameters in URLs
CN104272263A (en) * 2012-02-29 2015-01-07 网络装置公司 Fragmentation control for performing deduplication operations
CN104933056A (en) * 2014-03-18 2015-09-23 腾讯科技(深圳)有限公司 Uniform resource locator (URL) de-duplication method and device
CN106844389A (en) * 2015-12-07 2017-06-13 阿里巴巴集团控股有限公司 The treating method and apparatus of network resources address URL
CN107169121A (en) * 2017-05-27 2017-09-15 北京知道未来信息技术有限公司 A kind of extraction website URL method

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7680773B1 (en) * 2005-03-31 2010-03-16 Google Inc. System for automatically managing duplicate documents when crawling dynamic documents
CN101764720A (en) * 2009-11-24 2010-06-30 福建星网锐捷网络有限公司 Method and system for testing filtration performance
CN104272263A (en) * 2012-02-29 2015-01-07 网络装置公司 Fragmentation control for performing deduplication operations
CN103530336A (en) * 2013-09-30 2014-01-22 北京奇虎科技有限公司 Equipment and method for identifying invalid parameters in URLs
CN104933056A (en) * 2014-03-18 2015-09-23 腾讯科技(深圳)有限公司 Uniform resource locator (URL) de-duplication method and device
CN106844389A (en) * 2015-12-07 2017-06-13 阿里巴巴集团控股有限公司 The treating method and apparatus of network resources address URL
CN107169121A (en) * 2017-05-27 2017-09-15 北京知道未来信息技术有限公司 A kind of extraction website URL method

Cited By (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112492353B (en) * 2019-09-12 2023-06-23 杭州山草互娱科技有限公司 Processing method, device, equipment and storage medium for data in live broadcasting room
CN112492353A (en) * 2019-09-12 2021-03-12 武汉斗鱼鱼乐网络科技有限公司 Method, device and equipment for processing data in live broadcast room and storage medium
CN110825947A (en) * 2019-10-31 2020-02-21 深圳前海微众银行股份有限公司 URL duplicate removal method, device, equipment and computer readable storage medium
CN110825947B (en) * 2019-10-31 2024-03-08 深圳前海微众银行股份有限公司 URL deduplication method, device, equipment and computer readable storage medium
WO2021082938A1 (en) * 2019-10-31 2021-05-06 深圳前海微众银行股份有限公司 Url deduplication method, apparatus, device and computer-readable storage medium
CN111259282A (en) * 2020-02-13 2020-06-09 深圳市腾讯计算机系统有限公司 URL duplicate removal method and device, electronic equipment and computer readable storage medium
CN111259282B (en) * 2020-02-13 2023-08-29 深圳市腾讯计算机系统有限公司 URL (Uniform resource locator) duplication removing method, device, electronic equipment and computer readable storage medium
CN111935133A (en) * 2020-08-06 2020-11-13 北京顶象技术有限公司 White list generation method and device
CN112015483A (en) * 2020-08-07 2020-12-01 北京浪潮数据技术有限公司 POST request parameter automatic processing method and device and readable storage medium
CN112015483B (en) * 2020-08-07 2021-12-03 北京浪潮数据技术有限公司 POST request parameter automatic processing method and device and readable storage medium
CN112804373A (en) * 2020-12-30 2021-05-14 微医云(杭州)控股有限公司 Interface domain name determining method and device, electronic equipment and storage medium
CN112287201A (en) * 2020-12-31 2021-01-29 北京精准沟通传媒科技股份有限公司 Method, device, medium and electronic equipment for removing duplicate of crawler request
CN114553550A (en) * 2022-02-24 2022-05-27 京东科技信息技术有限公司 Request detection method and device, storage medium and electronic equipment
CN114553550B (en) * 2022-02-24 2024-02-02 京东科技信息技术有限公司 Request detection method and device, storage medium and electronic equipment

Also Published As

Publication number Publication date
CN109359250B (en) 2022-05-31

Similar Documents

Publication Publication Date Title
CN109359250A (en) Uniform resource locator processing method, device, server and readable storage medium storing program for executing
CN109165136B (en) Terminal operation data monitoring method, terminal device and medium
CN112039942A (en) Subscription and publishing method and server
CN107809383B (en) MVC-based path mapping method and device
CN110677384B (en) Phishing website detection method and device, storage medium and electronic device
US10657182B2 (en) Similar email spam detection
CN111399756B (en) Data storage method, data downloading method and device
CN112954089B (en) Method, device, equipment and storage medium for analyzing data
CN104572727A (en) Data querying method and device
CN108875091A (en) A kind of distributed network crawler system of unified management
CN110413845B (en) Resource storage method and device based on Internet of things operating system
CN109614327B (en) Method and apparatus for outputting information
CN111182089A (en) Container cluster system, method and device for accessing big data assembly and server
CN112073374B (en) Information interception method, device and equipment
CN113452780A (en) Access request processing method, device, equipment and medium for client
US10931688B2 (en) Malicious website discovery using web analytics identifiers
CN108154024B (en) Data retrieval method and device and electronic equipment
CN110532774A (en) Hook inspection method, device, server and readable storage medium storing program for executing
US10491606B2 (en) Method and apparatus for providing website authentication data for search engine
CN110674427B (en) Method, device, equipment and storage medium for responding to webpage access request
CN111368227A (en) URL processing method and device
CN113810381A (en) Crawler detection method, web application cloud firewall, device and storage medium
CN114238767B (en) Service recommendation method, device, computer equipment and storage medium
CN109086414B (en) Method, apparatus and storage medium for searching blockchain data
CN110392032B (en) Method, device and storage medium for detecting abnormal URL

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right

Effective date of registration: 20200924

Address after: Cayman Enterprise Centre, 27 Hospital Road, George Town, Grand Cayman Islands

Applicant after: Innovative advanced technology Co.,Ltd.

Address before: Cayman Enterprise Centre, 27 Hospital Road, George Town, Grand Cayman Islands

Applicant before: Advanced innovation technology Co.,Ltd.

Effective date of registration: 20200924

Address after: Cayman Enterprise Centre, 27 Hospital Road, George Town, Grand Cayman Islands

Applicant after: Advanced innovation technology Co.,Ltd.

Address before: A four-storey 847 mailbox in Grand Cayman Capital Building, British Cayman Islands

Applicant before: Alibaba Group Holding Ltd.

TA01 Transfer of patent application right
GR01 Patent grant
GR01 Patent grant