CN112084439A

CN112084439A - Method, device, equipment and storage medium for identifying variable in URL

Info

Publication number: CN112084439A
Application number: CN202010909457.XA
Authority: CN
Inventors: 尚侠; 张雪松; 罗清篮; 陈宁
Original assignee: Shanghai Mule Network Technology Co ltd
Current assignee: Shanghai Mule Network Technology Co ltd
Priority date: 2020-09-02
Filing date: 2020-09-02
Publication date: 2020-12-15
Anticipated expiration: 2040-09-02
Also published as: CN112084439B

Abstract

The invention relates to a method, a device, equipment and a storage medium for identifying variables in a URL. The method comprises the following steps: acquiring access path data of a website to be identified; preprocessing the access path data to obtain a path relation data set and a level threshold value set; identifying quantitative data and suspected variable data in the access path data according to the path relation data set and the level threshold value set; checking the suspected variable data to obtain variable data; and integrating and outputting the quantitative data and the variable data. By adopting the method, the variable identification can be automatically carried out on the access path data, and the variable identification efficiency is greatly improved.

Description

Method, device, equipment and storage medium for identifying variable in URL

Technical Field

The invention relates to the technical field of website testing and protection, in particular to a method, a device, equipment and a storage medium for identifying variables in a URL (uniform resource locator).

Background

With the popularization of the application of websites, more and more websites are applied to various industries. Before a website is put into use, in order to ensure that the website can normally operate according to an expected plan, a penetration test needs to be performed on the website. Submitting attack codes through parameters is a common means of penetration testing or scanning of websites.

There are typically 3 forms of parameter delivery, the first known as Query String, which is delivered via a URL, such as http:// www.host.com? The section "a ═ 1 ═ b ═ 2" in a ═ 1& b ═ 2 represents that the value of parameter a is 1 and the value of parameter b is 2. Such approaches are common to acquiring data if details of an article are acquired. And the second method is to transmit through a form, and the content filled by the user is assembled by the front end according to the requirement and then is put in the payload part of the request data packet. Such approaches are often used to send data to the back end, such as sending article related content to the back end when creating an article. The third is the "5 f0ea827cf1361002210387 f" part contained in the URL, such as http:// www.host.com/project/5f0ea827cf1361002210387f/tasks, which the back-end uses as a parameter, in this case as an object id. Such approaches are common in RESTful APIs or pseudo-staticizing of routes, which may occur when data is obtained from or sent to the back-end. Other existing approaches are usually manual tagging, and usually treat the entire URL as a complete variable, for example, when a website uses RESTful API or address pseudo-staticizing, sometimes parameters are submitted as part of the access path, and it is not known that part is a parameter exactly as a form. Meanwhile, the prior art usually abandons the analysis of variable parts in addresses or marks them in a way of manually defining configuration when identifying request parameters. Therefore, the existing penetration method will affect the learning method of machine learning and the accuracy of machine learning, and

disclosure of Invention

In view of the foregoing, it is an object of the present invention to overcome the deficiencies of the prior art and to provide a method, an apparatus, a device and a storage medium for identifying a variable in a URL.

In order to achieve the purpose, the invention adopts the following technical scheme:

a method of identifying a variable in a URL, comprising:

acquiring access path data of a website to be identified;

preprocessing the access path data to obtain a path relation data set and a level threshold value set;

identifying quantitative data and suspected variable data in the access path data according to the path relation data set and the level threshold set;

checking the suspected variable data to obtain variable data;

and integrating and outputting the quantitative data and the variable data.

Optionally, the preprocessing the access path data to obtain a path relationship data set and a level threshold set includes:

dividing the access path data according to a preset rule to obtain a plurality of path section nodes;

generating a node data structure according to the path segment node and the path relation; the path relation is obtained by the path section node according to the access path data;

and obtaining the path relation data set and the level threshold value set according to the node data structure.

Optionally, the path relation data set includes: a child node number, a referenced number, and a sibling node number;

the set of hierarchical thresholds comprises: a child node number threshold, a referenced number threshold, and a back-referenced coefficient threshold;

the obtaining the path relation data set and the level threshold value set according to the node data structure includes:

counting the number of child nodes, the number of father nodes and the number of brother nodes of each path section node;

calculating the average value of the number of the quoted nodes and the number of the quoted nodes of the path section according to the number of the father nodes;

calculating the sub-node number weighting coefficient of each level in the node data structure according to the sub-node number;

calculating the sub-node number threshold according to the sub-node number weighting coefficient;

calculating the weighted coefficient of the number to be quoted of each level node according to the average value of the number to be quoted;

calculating the referenced number threshold according to the referenced number weighting coefficient;

calculating a reverse reference coefficient according to the parent node number and the brother node number;

calculating the backward reference coefficient mean value according to the backward reference coefficient;

calculating the back reference coefficient threshold using the back reference coefficient mean.

Optionally, the identifying quantitative data and suspected variable data in the access path data according to the path relation data set and the level threshold set includes:

judging whether the number of the child nodes of any path segment node is larger than the threshold value of the number of the child nodes;

if yes, the path section node is judged to be quantitative data;

otherwise, judging whether the number of the quoted objects is larger than the threshold value of the number of the quoted objects;

if yes, the path section node is judged to be quantitative data;

otherwise, judging whether the number of the quoted nodes is larger than the number of the brother nodes;

if yes, the path section node is judged to be quantitative data;

otherwise, judging whether the reverse reference coefficient threshold value is divided by the child node number threshold value or not;

if yes, the path section node is judged to be quantitative data;

otherwise, the path section node is judged to be the suspected variable data.

Optionally, the method further includes:

and marking the suspected variable data by using a preset wildcard character, and generating a tree-shaped path structure by combining the quantitative data.

Optionally, the verifying the suspected variable data to obtain variable data includes:

traversing lower nodes of the suspected variable data and lower nodes of the quantitative data in the tree path structure;

judging whether quantitative nodes meeting a first preset condition exist in subordinate nodes of the quantitative data or not; the quantitative node is the same as a lower node of the suspected variable data;

if yes, judging the suspected variable data to be quantitative data;

otherwise, judging whether the suspected variable data has a brother node meeting a second preset condition; the brother node is a terminal node, and the brother node is quantitative data;

if yes, judging the suspected variable data to be quantitative data;

otherwise, judging the suspected variable data to be variable data.

Optionally, the preset rule is: with "/" as the segmentation point.

An apparatus for identifying variables in a URL, comprising:

the access path acquisition module is used for acquiring access path data of the website to be identified;

the preprocessing module is used for preprocessing the access path data to obtain a path relation data set and a level threshold set;

the quantitative identification module is used for identifying quantitative data and suspected variable data in the access path data according to the path relation data set and the level threshold value set;

the suspected variable checking module is used for checking the suspected variable data to obtain variable data;

and the result integration output module is used for integrating and outputting the quantitative data and the variable data.

An apparatus for identifying variables in a URL, comprising:

a processor, and a memory coupled to the processor;

said memory being adapted to store a computer program adapted to at least perform said method of identifying a variable in a URL;

the processor is used for calling and executing the computer program in the memory.

A storage medium storing a computer program which, when executed by a processor, performs the steps of the method of identifying a variable in a URL as described above.

The technical scheme provided by the application can comprise the following beneficial effects:

the application discloses a method for identifying variables in a URL, which comprises the following steps: acquiring access path data of a website to be identified; preprocessing the access path data to obtain a path relation data set and a level threshold value set; identifying quantitative data and suspected variable data in the access path data according to the path relation data set and the level threshold value set; checking the suspected variable data to obtain variable data; and integrating and outputting the quantitative data and the variable data. According to the method, the access path data of the website is preprocessed, then the quantitative data and the suspected variable data in the path are identified, then the suspected variable data are verified, the final variable data are determined, and then the quantitative data and the variable data are integrated to obtain the final identification result. According to the method, the variable data part in the access path can be automatically analyzed and calculated through the access data, and the variable identification efficiency is greatly improved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

FIG. 1 is a flowchart of a method for identifying variables in a URL according to an embodiment of the present invention;

FIG. 2 is a flow chart of a method of pre-processing provided by an embodiment of the present invention;

FIG. 3 is a flow chart of a method of quantitative data identification provided by an embodiment of the present invention;

FIG. 4 is a flowchart of a method for checking suspected variables according to an embodiment of the present invention;

FIG. 5 is a block diagram of an apparatus for identifying variables in a URL according to one embodiment of the present invention;

fig. 6 is a diagram illustrating an apparatus for identifying variables in a URL according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the technical solutions of the present invention will be described in detail below. It is to be understood that the described embodiments are merely exemplary of the invention, and not restrictive of the full scope of the invention. All other embodiments, which can be derived by a person skilled in the art from the examples given herein without any inventive step, are within the scope of the present invention.

FIG. 1 is a flowchart of a method for identifying variables in a URL according to an embodiment of the present invention. Referring to fig. 1, a method of identifying a variable in a URL, comprising:

step 101: and acquiring access path data of the website to be identified. The access path data in the present application is log data.

Step 102: and preprocessing the access path data to obtain a path relation data set and a level threshold value set. In this step, after the access path data to be analyzed is loaded into the system corresponding to this embodiment, the data is preprocessed in advance, and then data capable of performing subsequent identification operation is obtained.

Step 103: and identifying quantitative data and suspected variable data in the access path data according to the path relation data set and the hierarchy threshold value set.

Step 104: and checking the suspected variable data to obtain variable data.

Step 105: and integrating and outputting the quantitative data and the variable data.

According to the method, the variable data part and the quantitative data part in the path are identified by analyzing the access path data, so that the automatic identification function of the variable part in the network path is realized, the variable identification efficiency is greatly improved, and the efficiency of the penetration test is further improved.

More specifically, on the basis of the above embodiment, the present application further discloses a step 102 of preprocessing the access path data to obtain a path relation data set and a hierarchical threshold set, which includes the following specific steps:

FIG. 2 is a flow chart of a method of preprocessing provided by an embodiment of the present invention. Referring to fig. 2, preprocessing the access path data to obtain a path relation data set and a level threshold set, includes:

step 201: and segmenting the access path data according to a preset rule to obtain a plurality of path segment nodes. In the present application, a path is divided into a plurality of path segments using "/" according to the characteristics of the URL, and the individual path segments are collectively referred to as path segment nodes.

Step 202: generating a node data structure according to the path segment node and the path relation; and the path relation is obtained by the path section node according to the access path data. Generating a node data structure by each path segment node according to the original path relation, wherein when the paths of the original data in the log are respectively/a/b/c,/a/b/e,/a/d/c, the node data structure is as follows:

step 203: and obtaining the path relation data set and the level threshold value set according to the node data structure. Wherein the path relation data set comprises: a child node number, a referenced number, and a sibling node number; the set of hierarchical thresholds includes: a child node number threshold, a referenced number threshold, and a back-referenced coefficient threshold.

The specific process of step 203 is as follows: and counting the number of child nodes, the number of father nodes and the number of brother nodes of each path section node. For example: in the above-mentioned node data structure, the child nodes of b have c and e, so the number of child nodes of b is 2. The parent node of the node c has two types, b and d, so the number of the parent node of the node c is 2. The brother node of the e node is c, the number of the brother nodes is 1, and the brother node of the c is e and the number of the brother nodes is 1.

And calculating the average value of the number of the quoted nodes and the number of the quoted nodes of the path segment according to the parent node number. Taking the node data structure as an example, e is referenced 1 time by b, the referenced number is 1, c is referenced 2 times by b and d respectively, the referenced number is 2, the referenced number average value of the child node levels of b is (1+ 2)/2-1.5, and the referenced number average value of the child node levels of d is 2/1-2 at the depth of level 3.

Calculating the sub-node number weighting coefficient of each level in the node data structure according to the sub-node number; calculating the sub-node number threshold according to the sub-node number weighting coefficient; and the sub-node number threshold is obtained according to the sub-node number weighting coefficient and the dispersion degree between the superposed data.

Calculating the weighted coefficient of the number to be quoted of each level node according to the average value of the number to be quoted; calculating the referenced number threshold according to the referenced number weighting coefficient; and the weighted coefficient of the number of the quoted data is calculated according to the number of the quoted data of each path section node of the hierarchy and the dispersion among the data.

Calculating a reverse reference coefficient according to the parent node number and the brother node number; calculating the backward reference coefficient mean value according to the backward reference coefficient; calculating the back reference coefficient threshold using the back reference coefficient mean.

It should be noted that the threshold values mentioned in the above embodiments may be replaced by values obtained by a calculation method for segmenting the corpus, such as a mean value, a median value, a value obtained by (max + min)/2, and the like, and the specific form is not limited. The weighting coefficients can be replaced by values expressing discrete relations among the data of the complete set, such as standard deviation, variance and the like, and the specific expression form is not limited.

In more detail, on the basis of the above embodiment, the present application further discloses a step 103 of identifying an implementation process of quantitative data and suspected variable data in the access path data according to the path relation data set and the hierarchical threshold set, which is specifically as follows:

FIG. 3 is a flow chart of a method for quantitative data identification according to an embodiment of the present invention. Identifying quantitative data and suspected variable data within the access path data according to the path relationship data set and the level threshold set, including:

step 301: judging whether the number of the child nodes of any path segment node is larger than the threshold value of the number of the child nodes;

step 302: if the number of the child nodes is larger than the child node number threshold value, the path section node is judged to be quantitative data;

step 303: judging whether the number of the child nodes is not larger than the threshold value of the number of the child nodes or not, and judging whether the number of the quoted nodes is larger than the threshold value of the number of the quoted nodes or not; if yes, go to step 302;

step 304: judging whether the number of the quoted nodes is larger than the number of the sibling nodes or not; if yes, go to step 302;

step 305: determining whether the number of the referenced nodes is not more than the number of the sibling nodes, and dividing the reverse reference coefficient threshold by the number of the child nodes to determine whether the number of the child nodes is more than the child node threshold; if yes, go to step 302;

step 306: and dividing the back reference coefficient threshold by the sub-node number not greater than the sub-node number threshold, and judging the path segment node as the suspected variable data.

Further, on the basis of the above embodiment, the method further includes: and marking the suspected variable data by using a preset wildcard character, and generating a tree-shaped path structure by combining the quantitative data. And generalizing the variable nodes into "&" according to the judgment result, merging the variable nodes and outputting the merged variable nodes to a data structure, wherein if the judgment b and the judgment d are suspected variables in the example and other judgment results are quantitative, the output tree path structure is as follows:

meanwhile, on the basis of the above embodiment, the present application further discloses an implementation process of step 104, which is specifically as follows:

fig. 4 is a flowchart of a method for checking a suspected variable according to an embodiment of the present invention. Referring to fig. 4, verifying the suspected variable data to obtain variable data includes:

step 401: and traversing lower nodes of the suspected variable data and lower nodes of the quantitative data in the tree path structure.

Step 402: judging whether quantitative nodes meeting a first preset condition exist in subordinate nodes of the quantitative data or not; the quantitative node is the same as a lower node of the suspected variable data. The following data structure is taken as an example:

the original parent node of the e node is b.

Step 403: and if so, judging the suspected variable data to be quantitative data. F and b, which are regarded as quantitative nodes, have the same child node e, and at this stage, the variable b is restored to quantitative data, and the corrected output result is:

step 404: otherwise, judging whether the suspected variable data has a brother node meeting a second preset condition; the sibling node is a terminal node and the sibling node is quantitative data. If yes, go to step 403;

for example, the following cases:

after quantitative data identification processing, g is identified as a suspected variable, and the data structure is

Since g is an end node, its sibling h is also an end node and is considered quantitative, g is reduced to quantitative data at this stage.

Step 405: otherwise, judging the suspected variable data to be variable data.

In the embodiment, the variable part in the path can be automatically analyzed and calculated according to the log data, compared with manual marking, the workload can be reduced, the website does not need to be known, and the assistance of developers is not needed, the automatic calculation can be automatically updated along with the updating of the analyzed target, and the real-time performance is improved. Meanwhile, the method solves the problem that the variable cannot be effectively identified in the URL, and provides accurate learning characteristics for machine learning. Upon identifying the variables, such as/a/1/c/and/a/2/c can be merged into/a/? And c, combining originally dispersed weights together, and providing help for improving the precision of machine learning.

The embodiment of the invention also provides a device for identifying the variable in the URL. Please see the examples below.

FIG. 5 is a block diagram of an apparatus for identifying variables in a URL according to an embodiment of the present invention. An apparatus for identifying variables in a URL, comprising:

an access path obtaining module 501, configured to obtain access path data of a website to be identified;

a preprocessing module 502, configured to preprocess the access path data to obtain a path relation data set and a level threshold set;

a quantitative identification module 503, configured to identify quantitative data and suspected variable data in the access path data according to the path relationship data set and the level threshold set;

a suspected variable checking module 504, configured to check the suspected variable data to obtain variable data;

and a result integration output module 505, configured to integrate and output the quantitative data and the variable data.

In more detail, the preprocessing module 502 is specifically configured to: dividing the access path data according to a preset rule to obtain a plurality of path section nodes; generating a node data structure according to the path segment node and the path relation; the path relation is obtained by the path section node according to the access path data; and obtaining the path relation data set and the level threshold value set according to the node data structure.

The quantitative identification module 503 is specifically configured to: judging whether the number of the child nodes of any path segment node is larger than the threshold value of the number of the child nodes; if yes, the path section node is judged to be quantitative data; otherwise, judging whether the number of the quoted objects is larger than the threshold value of the number of the quoted objects; if yes, the path section node is judged to be quantitative data; otherwise, judging whether the number of the quoted nodes is larger than the number of the brother nodes; if yes, the path section node is judged to be quantitative data; otherwise, judging whether the reverse reference coefficient threshold value is divided by the child node number threshold value or not; if yes, the path section node is judged to be quantitative data; otherwise, the path section node is judged to be the suspected variable data.

The suspected variable checking module 504 is specifically configured to: traversing lower nodes of the suspected variable data and lower nodes of the quantitative data in the tree path structure; judging whether quantitative nodes meeting a first preset condition exist in subordinate nodes of the quantitative data or not; the quantitative node is the same as a lower node of the suspected variable data; if yes, judging the suspected variable data to be quantitative data; otherwise, judging whether the suspected variable data has a brother node meeting a second preset condition; the brother node is a terminal node, and the brother node is quantitative data; if yes, judging the suspected variable data to be quantitative data; otherwise, judging the suspected variable data to be variable data.

Further, on the basis of the above embodiments, the apparatus in the present application further includes:

and the wildcard character marking module is used for marking the suspected variable data by using a preset wildcard character and generating a tree-shaped path structure by combining the quantitative data.

The variable identification device can be used for automatically identifying the variable in the access path, and the variable identification efficiency is greatly improved. Meanwhile, the identified scattered variables are integrated, and help is provided for improving the precision of machine learning.

In order to more clearly introduce a hardware system implementing the embodiment of the present invention, in correspondence to the method for identifying a variable in a URL provided in the embodiment of the present invention, an embodiment of the present invention further provides a device for identifying a variable in a URL. Please see the examples below.

Fig. 6 is a diagram illustrating an apparatus for identifying variables in a URL according to an embodiment of the present invention. Referring to fig. 6, an apparatus for identifying a variable in a URL, includes:

a processor 601, and a memory 602 connected to the processor 601;

the memory 602 is used for storing a computer program for performing at least the above-mentioned method of identifying a variable in a URL;

the processor 601 is used for calling and executing the computer program in the memory 602.

On the basis of the above embodiment, a storage medium is also disclosed, which stores a computer program that, when executed by a processor, implements the steps of the method for identifying variables in URLs as described above.

It is understood that the same or similar parts in the above embodiments may be mutually referred to, and the same or similar parts in other embodiments may be referred to for the content which is not described in detail in some embodiments.

It should be noted that the terms "first," "second," and the like in the description of the present invention are used for descriptive purposes only and are not to be construed as indicating or implying relative importance. Further, in the description of the present invention, the meaning of "a plurality" means at least two unless otherwise specified.

Any process or method descriptions in flow charts or otherwise described herein may be understood as representing modules, segments, or portions of code which include one or more executable instructions for implementing specific logical functions or steps of the process, and alternate implementations are included within the scope of the preferred embodiment of the present invention in which functions may be executed out of order from that shown or discussed, including substantially concurrently or in reverse order, depending on the functionality involved, as would be understood by those reasonably skilled in the art of the present invention.

It should be understood that portions of the present invention may be implemented in hardware, software, firmware, or a combination thereof. In the above embodiments, the various steps or methods may be implemented in software or firmware stored in memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, any one or combination of the following techniques, which are known in the art, may be used: a discrete logic circuit having a logic gate circuit for implementing a logic function on a data signal, an application specific integrated circuit having an appropriate combinational logic gate circuit, a Programmable Gate Array (PGA), a Field Programmable Gate Array (FPGA), or the like.

It will be understood by those skilled in the art that all or part of the steps carried by the method for implementing the above embodiments may be implemented by hardware related to instructions of a program, which may be stored in a computer readable storage medium, and when the program is executed, the program includes one or a combination of the steps of the method embodiments.

In addition, functional units in the embodiments of the present invention may be integrated into one processing module, or each unit may exist alone physically, or two or more units are integrated into one module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode. The integrated module, if implemented in the form of a software functional module and sold or used as a stand-alone product, may also be stored in a computer readable storage medium.

The storage medium mentioned above may be a read-only memory, a magnetic or optical disk, etc.

In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above do not necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.

Although embodiments of the present invention have been shown and described above, it is understood that the above embodiments are exemplary and should not be construed as limiting the present invention, and that variations, modifications, substitutions and alterations can be made to the above embodiments by those of ordinary skill in the art within the scope of the present invention.

Claims

1. A method of identifying a variable in a URL, comprising:

acquiring access path data of a website to be identified;

checking the suspected variable data to obtain variable data;

and integrating and outputting the quantitative data and the variable data.

2. The method of claim 1, wherein preprocessing the access path data to obtain a set of path relationship data and a set of hierarchical thresholds comprises:

3. The method of claim 2, wherein the path relationship dataset comprises: a child node number, a referenced number, and a sibling node number;

4. The method of claim 3, wherein identifying quantitative data and plausible variable data within the access path data from the set of path relationship data and the set of hierarchical thresholds comprises:

if yes, the path section node is judged to be quantitative data;

otherwise, the path section node is judged to be the suspected variable data.

5. The method of claim 1, further comprising:

6. The method of claim 5, wherein the verifying the suspected variable data to obtain variable data comprises:

if yes, judging the suspected variable data to be quantitative data;

otherwise, judging the suspected variable data to be variable data.

7. The method according to claim 2, wherein the preset rule is: with "/" as the segmentation point.

8. An apparatus for identifying a variable in a URL, comprising:

9. An apparatus for identifying variables in a URL, comprising:

a processor, and a memory coupled to the processor;

the memory for storing a computer program for performing at least the method of identifying a variable in a URL of any one of claims 1-7;

10. A storage medium, characterized in that it stores a computer program which, when executed by a processor, carries out the steps of the method of identifying variables in URLs according to any one of claims 1 to 7.