CN113282849A - Similar URL character string recognition method and device, computer equipment and storage medium - Google Patents

Similar URL character string recognition method and device, computer equipment and storage medium Download PDF

Info

Publication number
CN113282849A
CN113282849A CN202110494522.1A CN202110494522A CN113282849A CN 113282849 A CN113282849 A CN 113282849A CN 202110494522 A CN202110494522 A CN 202110494522A CN 113282849 A CN113282849 A CN 113282849A
Authority
CN
China
Prior art keywords
code
url
url character
row
coding matrix
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110494522.1A
Other languages
Chinese (zh)
Inventor
张强
王涛
皇甫道一
张昭
刘浩杰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing Suning Software Technology Co ltd
Original Assignee
Nanjing Suning Software Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing Suning Software Technology Co ltd filed Critical Nanjing Suning Software Technology Co ltd
Priority to CN202110494522.1A priority Critical patent/CN113282849A/en
Publication of CN113282849A publication Critical patent/CN113282849A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/955Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
    • G06F16/9566URL specific, e.g. using aliases, detecting broken or misspelled links
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • G06F16/316Indexing structures
    • G06F16/325Hash tables
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/50Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
    • G06F21/57Certifying or maintaining trusted computer platforms, e.g. secure boots or power-downs, version controls, system software checks, secure updates or assessing vulnerabilities
    • G06F21/577Assessing vulnerabilities and evaluating computer system security

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Computer Hardware Design (AREA)
  • Computer Security & Cryptography (AREA)
  • Software Systems (AREA)
  • Computational Linguistics (AREA)
  • Computing Systems (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The application discloses a method and a device for identifying similar URL character strings, computer equipment and a storage medium, which belong to the technical field of information security, and the method comprises the following steps: acquiring a plurality of URL character strings meeting preset conditions; binary coding is carried out on the designated fields in the URL character strings, a coding matrix is generated according to the coding result of the designated fields in the URL character strings, and each row of codes in the coding matrix corresponds to one URL character string; aiming at the current row code in the coding matrix, searching all target row codes similar to the current row code in the coding matrix; and determining the URL character string corresponding to the target line code as the URL character string corresponding to the target line code of the current line code. The embodiment of the application realizes batch identification of similar URL character strings, solves the problem of memory overflow caused by the fact that all URLs need to be read into a memory in the traditional method, and achieves the purposes of saving storage space and being easy to compare.

Description

Similar URL character string recognition method and device, computer equipment and storage medium
Technical Field
The present application relates to the field of information security technologies, and in particular, to a method and an apparatus for identifying a similar URL string, a computer device, and a storage medium.
Background
With the continuous development of information technology and the rapid popularization of the internet, the network has become an indispensable technical tool in daily work and life of people. The method brings real-time convenience to people, and meanwhile, the network security problem is increasingly highlighted. The vulnerability scanning technology is an important information security technology, and can be used in combination with other information security technologies (such as a Web application firewall, an intrusion detection system, a database auditing system and the like), so that the information security protection capability can be improved, and the network security risk can be obviously reduced. Vulnerability scanning is a security detection method for finding vulnerabilities by detecting the security vulnerabilities of a designated remote or local computer system through means such as scanning. The method mainly comprises different categories such as network missing scanning, host missing scanning, database missing scanning and the like.
Most Web scanning systems use a domain name as a unit, adopt a webpage crawler mode to simulate a real browsing situation of a user, comprehensively and deeply crawl a website URL (Uniform Resource Locator), adopt rich scanning plug-ins, deeply analyze website response information, and help the user to find potential safety hazards of the website. Since many of the obtained URLs are the same or similar (e.g., URLs with only individual parameters different), it makes no sense to scan these same or similar URLs, and the scanning efficiency of the missing-scan system is also severely affected. Therefore, it is important to identify the same or similar URLs from all the obtained URLs. However, the existing methods have the following problems:
(1) only the same URL can be recognized, and similar URLs are difficult to recognize;
(2) in the traditional method, all URLs are generally required to be read into a memory and then traversal comparison is carried out, and mass URL storage generally requires more storage resources, so that the original URL character strings are difficult to be read into the memory simultaneously.
Disclosure of Invention
In order to solve the problems mentioned in the background art, the present application provides a method, an apparatus, a device and a storage medium for identifying similar URL character strings, wherein the technical scheme is as follows:
in a first aspect, a method for identifying similar URL strings is provided, the method including:
acquiring a plurality of URL character strings meeting preset conditions;
binary coding is carried out on the designated fields in the URL character strings, a coding matrix is generated according to the coding result of the designated fields in the URL character strings, and each row of codes in the coding matrix corresponds to one URL character string;
aiming at the current row code in the coding matrix, finding out all target row codes similar to the current row code in the coding matrix;
and determining the URL character string corresponding to the target line code as the URL character string corresponding to the target line code of the current line code.
Further, before the step of binary encoding the specified field in each URL string, the method further includes:
and deleting the characters meeting preset deleting conditions in the specified fields aiming at the specified fields in each URL character string.
Preferably, the preset deleting condition includes: deleting all numbers in the specified field and deleting content between special characters in the specified field.
Further, the binary encoding of the specified field in each URL string includes:
aiming at the designated field in each URL character string, the designated field is processed by using an improved SimHash algorithm to obtain a binary code with fixed length.
Further, the finding out all target row codes similar to the current row code in the coding matrix for the current row code in the coding matrix comprises:
aiming at the current row code in the coding matrix, performing logic operation on columns corresponding to all elements with preset values in the current row code to obtain a logic operation result;
and searching all target row codes similar to the current row code in the coding matrix according to the logical operation result.
Further, the finding out all target row codes similar to the current row code in the coding matrix according to the logical operation result includes:
determining all candidate row codes in the coding matrix according to the elements with the element values being preset values in the logical operation result, and calculating the Hamming distance between the current row code and each candidate row code;
for each of the candidate row codes, determining the candidate row code as all target row codes similar to the current row code when the Hamming distance between the current row code and the candidate row code does not exceed a preset threshold.
Further, the method further comprises:
and deleting URL character strings corresponding to all target line codes similar to the current line code in the plurality of URL character strings.
In a second aspect, an apparatus for identifying similar URL strings is provided, the apparatus comprising:
the acquisition module is used for acquiring a plurality of URL character strings meeting preset conditions;
the encoding module is used for carrying out binary encoding on the designated fields in the URL character strings and generating an encoding matrix according to the encoding result of the designated fields in the URL character strings, wherein each row of codes in the encoding matrix corresponds to one URL character string;
the searching module is used for searching all target row codes similar to the current row codes in the coding matrix aiming at the current row codes in the coding matrix;
and the determining module is used for determining the URL character string corresponding to the target line code as the URL character string corresponding to the target line code of the current line code.
Further, the apparatus further comprises:
and the generalization module is used for deleting the characters meeting preset deletion conditions in the specified fields aiming at the specified fields in each URL character string.
Preferably, the preset deleting condition includes:
deleting all numbers in the specified field and deleting content between special characters in the specified field.
Further, the encoding module is specifically configured to:
aiming at the designated field in each URL character string, the designated field is processed by using an improved SimHash algorithm to obtain a binary code with fixed length.
Further, the lookup module includes:
the determining submodule is used for carrying out logical operation on columns corresponding to all elements with preset values in the current row code aiming at the current row code in the coding matrix to obtain a logical operation result;
and the searching submodule is used for searching all target row codes similar to the current row code in the coding matrix according to the logical operation result.
Further, the search submodule is specifically configured to:
determining all candidate row codes in the coding matrix according to the elements with the element values being preset values in the logical operation result, and calculating the Hamming distance between the current row code and each candidate row code;
for each of the candidate row codes, determining the candidate row code as all target row codes similar to the current row code when the Hamming distance between the current row code and the candidate row code does not exceed a preset threshold.
Further, the apparatus further comprises:
and the deleting module is used for deleting all URL character strings corresponding to the target line codes similar to the current line code in the plurality of URL character strings.
In a third aspect, a computer device is provided, which includes a memory, a processor, and a computer program stored on the memory and capable of running code on the processor, wherein the processor executes the computer program to implement the following steps:
acquiring a plurality of URL character strings meeting preset conditions;
binary coding is carried out on the designated fields in the URL character strings, a coding matrix is generated according to the coding result of the designated fields in the URL character strings, and each row of codes in the coding matrix corresponds to one URL character string;
aiming at the current row code in the coding matrix, finding out all target row codes similar to the current row code in the coding matrix;
and determining the URL character string corresponding to the target line code as the URL character string corresponding to the target line code of the current line code.
In a fourth aspect, there is provided a computer readable storage medium having a computer program stored thereon, the computer program when executed by a processor implementing the steps of:
acquiring a plurality of URL character strings meeting preset conditions;
binary coding is carried out on the designated fields in the URL character strings, a coding matrix is generated according to the coding result of the designated fields in the URL character strings, and each row of codes in the coding matrix corresponds to one URL character string;
aiming at the current row code in the coding matrix, finding out all target row codes similar to the current row code in the coding matrix;
and determining the URL character string corresponding to the target line code as the URL character string corresponding to the target line code of the current line code.
The embodiment of the application provides a method and a device for identifying similar URL character strings, computer equipment and a storage medium, wherein a plurality of URL character strings meeting preset conditions are obtained; binary coding is carried out on the designated fields in the URL character strings, a coding matrix is generated according to the coding result of the designated fields in the URL character strings, and each row of codes in the coding matrix corresponds to one URL character string; aiming at the current row code in the coding matrix, searching all target row codes similar to the current row code in the coding matrix; determining the URL character string corresponding to the target line code as the URL character string corresponding to the target line code of the current line code, thereby realizing the identification of similar URL character strings; in addition, in the application scene of URL duplicate removal and similarity removal, complete URL character strings do not need to be stored, so that the URL character strings are converted into a form which is easy to store and compare by carrying out binary coding on specified fields in each URL character string in a plurality of URL character strings meeting preset conditions, the problem of memory overflow caused by the fact that all URLs need to be read into a memory in the traditional method is solved, and the purposes of saving storage space and being easy to compare are achieved.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
FIG. 1 is a flow chart illustrating a method for identifying similar URL character strings according to an embodiment of the present disclosure;
FIG. 2 illustrates a flow diagram for one embodiment of step 103 of the method illustrated in FIG. 1;
FIG. 3 is a diagram illustrating a similar URL string identification method provided by an embodiment of the present application;
FIG. 4 is a diagram illustrating an encoding matrix provided by an embodiment of the present application;
FIG. 5 is a schematic flowchart illustrating a process for supporting batch finding of similar URLs according to an embodiment of the present disclosure;
fig. 6 is a block diagram illustrating a similar URL string recognition apparatus according to an embodiment of the present application;
fig. 7 shows an internal structure diagram of a computer device provided in the present application.
Detailed Description
In order to make the purpose, technical solutions and advantages of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
It is to be understood that, unless the context clearly requires otherwise, throughout the description and the claims, the words "comprise", "comprising", and the like are to be construed in an inclusive sense as opposed to an exclusive or exhaustive sense; that is, what is meant is "including, but not limited to". In addition, in the description of the present application, "a plurality" means two or more unless otherwise specified.
As mentioned in the foregoing background, in the prior art, only the same URL can be identified, and it is difficult to identify similar URLs, and all URLs are usually read into a memory during the identification process, and then traversal comparison is performed, and mass URL storage usually needs to occupy more storage resources, so that it is difficult to read these original URL character strings into the memory at the same time. In view of this, the embodiment of the present application provides a method for identifying similar URL strings, which may be applied to a scenario where URLs are deduplicated and deduplicated, and the method can identify similar URLs without storing complete URL strings during the identification process, and by converting all URLs into a form that is easy to store and compare, the problem of memory overflow caused by that all URLs need to be read into a memory in a conventional method can be solved, and the purposes of saving a storage space and being easy to compare are achieved. It should be understood that, in addition to the application to the URL deduplication and similar deduplication, the method provided by the embodiment of the present application can be generalized to other application scenarios and technical fields that require deduplication or similar deduplication, for example, deduplication of online and offline shopping transaction data, massive text deduplication, various paper deduplication systems, and the like.
In addition, in the embodiment of the present application, if the domain name (Host) and the HTTP request Method (Method) of two URL strings are the same, and specified fields in the two URL strings are similar, it may be determined that the two URL strings are similar, where the specified fields in the URL strings refer to Scan _ key located after the domain name in the URL string and before the parameter; judging whether the designated fields in the two URL character strings are similar or not, performing binary coding on the designated fields in the two URL character strings, and if the hamming distance between the two obtained binary codes does not exceed a preset threshold value, determining that the designated fields in the two URL character strings are similar.
The similarity of the URL strings is described below in conjunction with the URL dataset structure features as exemplarily shown in Table 1 below.
Table 1: URL dataset structure features
id Host Method Scan_key
1 cn.xxx.com GET /AAA/keyword/Wing%AE2%B80%86li_111_ios_4.2.2_.htm
2 xxx.com POST /api/BBB/W111111/DDD278652.html
3 cn.xxx.com GET /AAA/keyword/Wing%AE2%B80%86lj_111_ios_4.2.2_.htm
4 xxx.com GET /api/BBB/W111111/DDD456521.html
Each row in table 1 represents a URL string, id number is used to uniquely identify each URL string in the URL dataset, Host records a domain name in each URL string, Method is an HTTP request Method, Scan _ key is a portion of the URL string after the domain name and before a parameter, because Host and Method in two URL strings with id 1 and id 3 are the same and Scan _ key is similar (here, it is assumed that a preset threshold is 3), it can be determined that two URL strings with id and id 3 are similar; while the two URL strings with id 2 and id 4 have the same Host and similar Scan _ key, their methods are different, so the two URL strings with id 2 and id 4 are not similar.
The scheme provided by the embodiment of the application is described in detail below.
In one embodiment, a similar URL string recognition method is provided, which may be performed by a similar URL string recognition apparatus or a server, which may be implemented in hardware and/or software, and which may be implemented in a stand-alone server or a server cluster. Referring to fig. 1, the method may include the steps of:
101, acquiring a plurality of URL character strings meeting preset conditions.
The plurality of URL strings that satisfy the preset condition may specifically be a plurality of URL strings in which the domain name and the HTTP request method are the same.
Specifically, the URL character strings are clustered according to the domain name in the URL character strings and the HTTP request method, and a plurality of URL character strings with the same domain name and the same HTTP request method are obtained.
And 102, carrying out binary coding on the specified fields in the URL character strings, and generating a coding matrix according to the coding result of the specified fields in the URL character strings, wherein each row in the coding matrix corresponds to one URL character string.
The designated field in the URL string is a field located after the domain name and before the parameter in the URL string, that is, Scan _ key in the URL string.
Specifically, for the designated field in each URL string, a binary code of a fixed length may be generated according to a preset coding algorithm, a coding result of the designated field in each URL string is obtained, and a coding matrix is generated, where each row code in the coding matrix corresponds to one URL string, where the URL string may be uniquely identified by an ID number.
The encoding length of the binary code may be determined by a data scale, and usually the length of the binary code does not exceed 64 bits, and the preset encoding algorithm may use a SimHash algorithm or other locally sensitive hash algorithms capable of mapping a specified field in a URL string to the binary code, which is not specifically limited in this embodiment.
For the current row code in the coding matrix, all target row codes similar to the current row code are found in the coding matrix 103.
The method comprises the steps of accessing each row code in a coding matrix row by row, and determining the currently accessed row code as the current row code.
Specifically, by calculating the hamming distance between the current row code and each of the other row codes in the coding matrix except the current row code, if the calculated hamming distance between the current row code and one of the other row codes does not exceed a preset threshold, it may be determined that the row code is a target row code similar to the current row code, where the preset threshold may be set according to actual needs, for example, set to 3, that is: when the number of different codes at the same position in two codes with the same length does not exceed 3, the two codes are similar.
And 104, determining the URL character string corresponding to the target line code as the URL character string corresponding to the target line code of the current line code.
In this embodiment, since each line code in the coding matrix corresponds to one URL character string, after all target line codes similar to the current line code are found, the URL character string corresponding to the target line code may be determined as the URL character string corresponding to the target line code of the current line code.
The embodiment of the application provides a method for identifying similar URL character strings, which comprises the steps of obtaining a plurality of URL character strings meeting preset conditions; binary coding is carried out on the designated fields in the URL character strings, a coding matrix is generated according to the coding result of the designated fields in the URL character strings, and each row of codes in the coding matrix corresponds to one URL character string; aiming at the current row code in the coding matrix, searching all target row codes similar to the current row code in the coding matrix; determining the URL character string corresponding to the target line code as the URL character string corresponding to the target line code of the current line code, thereby realizing the identification of similar URL character strings; in addition, in the application scene of URL duplicate removal and similarity removal, complete URL character strings do not need to be stored, so that the URL character strings are converted into a form which is easy to store and compare by carrying out binary coding on specified fields in each URL character string in a plurality of URL character strings meeting preset conditions, the problem of memory overflow caused by the fact that all URLs need to be read into a memory in the traditional method is solved, and the purposes of saving storage space and being easy to compare are achieved.
In an embodiment, before the binary encoding step of the specified field in each URL string, the method may further include:
and deleting the characters meeting preset deleting conditions in the specified fields aiming at the specified fields in each URL character string.
Preferably, the preset deleting condition includes: deleting all numbers in the specified field and deleting content between special characters in the specified field.
In a specific implementation, the special characters include "%", "-", "_", "and" - ", for example, it is assumed that there are two specified fields in the URL string:
“/AAA/keyword/Wing%AE2%B80%86li_111_ios_4.2.2_.htm”、
“/AAA/keyword/Wing%AE2%B80%86lj_111_ios_4.2.2_.htm”,
generalizing the two specified fields, and obtaining results respectively as follows: "/AAA/keyword/wingli. htm", "/AAA/keyword/wingli. htm".
It should be understood that the generalization processing for deleting the characters in the respective URL strings whose designated fields satisfy the preset deletion condition is an optional step.
In this embodiment, by generalizing the designated fields in each URL string, not only can two strings that are similar to each other be generalized into two identical or more similar strings, but also after the designated fields in each URL string are generalized, binary coding is performed on the designated fields in each generic URL string, so that processing resources for binary coding the designated fields can be saved.
In an embodiment, binary encoding the specified field in each URL string in step 102 may include:
aiming at the designated field in each URL character string, the improved SimHash algorithm is used for carrying out on the designated field to obtain the binary code with fixed length.
The SimHash algorithm is a locality sensitive hash (locality sensitive hash) algorithm. The SimHash algorithm may map high-dimensional data into a fixed-length binary code. Different from the traditional hash algorithm, the binary code obtained by the SimHash algorithm keeps the similarity of the original character string to a certain extent, namely the SimHash codes corresponding to the similar character strings contain a large number of same codes. Specifically, the execution flow of the conventional SimHash algorithm is as follows:
(1) and performing word segmentation on the given text to obtain n words { k1, k2, …, kn }, wherein the weight corresponding to each word is { w1, w2, …, wn }.
(2) The hash value of each word is obtained using a conventional hash algorithm, and is denoted as bi-hash (ki).
(3) And (4) weighting and combining. The obtained hash values bi are accumulated bit by bit. If a certain bit in the hash value is 0, adding-wi, if the certain bit is 1, adding wi, and finally obtaining the weighted value vj on each bit, wherein
Figure BDA0003053820570000111
(4) For each bit j, according to the accumulated value vjPositive or negative of (2) determines sjThe value of (c). If v isj<0, then this bit of the final hash value is not active, otherwise it is active, i.e.:
Figure BDA0003053820570000112
in this embodiment, since only one character string is involved in the application scenarios of URL deduplication and desimilarity, the process of word segmentation can be omitted, each URL character string is regarded as a word, and the weight is set to 1. In a specific implementation, the length of the code can be adjusted according to actual needs, and 64-bit codes are usually adopted, but in the application scenarios of URL deduplication and deduplication, if the data amount is small, the length of the code which is less than 64 bits can basically meet the requirements.
In the embodiment, each URL string is mapped and converted into a binary code (namely, a fingerprint code) with a fixed length by adopting an improved SimHash algorithm, so that the problem of memory overflow caused by the fact that all URLs need to be read into a memory in the traditional method can be solved, and the purposes of saving storage space and facilitating comparison are achieved.
Further, considering that the prior art typically determines whether two URLs are identical based on pairwise comparisons, this method can only identify one URL that is identical to the current URL at a time, cannot achieve a single time to find all URLs that are identical or similar to the current URL, and the space-time cost of this method is expensive, both in terms of horizontal comparisons (i.e., comparisons between different URLs) and vertical comparisons (i.e., comparisons between each character in a URL). For example, similar URLs are identified from 10 URLs (URL 1-URL 10), and assuming that URL1 is the same as URL6 and URL9, 8 line-to-line comparisons are required to find all URLs that are the same as URL1, and each comparison requires a further comparison of the characters at the corresponding location of each URL, which is obviously very inefficient.
In order to achieve the batch finding of similar URL strings, in an embodiment, as shown in fig. 2, the step 103 finds out a target row code similar to a current row code in a coding matrix for the current row code in the coding matrix, which may include the following steps:
and 201, aiming at the current row code in the coding matrix, performing logic operation on columns corresponding to all elements with preset values in the current row code to obtain a logic operation result.
Wherein the preset value can be set to 1.
Specifically, for a current row code in the coding matrix, a column where each element with an element value of a preset value in the current row code is located is determined, and logical operation is performed on the determined columns where all the elements are located, where the logical operation may specifically be logical and operation.
And 202, searching all target row codes similar to the current row code in the coding matrix according to the logical operation result.
Specifically, the implementation process of step 202 may include the following steps:
and a, determining all candidate row codes in the coding matrix according to the elements with the element values as preset values in the logical operation result, and calculating the Hamming distance between the current row code and each candidate row code.
The hamming distance refers to the number of different codes at the same position in two codes with the same length.
And b, aiming at each candidate line code, when the Hamming distance between the current line code and the candidate line code does not exceed a preset threshold value, determining the candidate line code as all target line codes similar to the current line code.
The preset threshold may be set according to actual needs, for example, set to 3.
Illustratively, assume that the respective codes of the specified fields in the two URL strings are respectively the r-th codes in the coding matrixiLine coding and rjLine coding, if r isiLine coding and rjIf the hamming distance between the line codes is not greater than the preset threshold δ (the threshold may be set to 3), the two specified fields are considered to be similar, and the two specified fields are similar to each other between the URL strings.
In specific implementation, first, a first variable (e.g., need _ del _ ids) may be used to store an id number corresponding to a URL that needs to be deleted, a second variable (e.g., rows _ num) may be used to record the total number of rows of the URL encoding matrix, and a third variable (e.g., need _ del _ indexes) may be used to mark the id number that has been accessed in order to avoid repeated access of data. For each row in the URL encoding matrix, it is first determined whether the row of data has been accessed, and if not, the subscript corresponding to the position 1 in the current row is stored using a fourth variable (e.g., true _ idx). If the number of 1's in the current row is equal to 0, then the current row is skipped. And recording the result of the logical and operation between different columns each time by using a fifth variable (e.g. temp _ col) as an intermediate variable, wherein the fifth variable is initially the column corresponding to the first 1 position in the current row. And sequentially carrying out logical AND operation on the column corresponding to the position 1 in the current row and the fifth variable, and reassigning the result to the fifth variable. The subscript corresponding to the position 1 in the current fifth variable is found, and then the relationship between the row of the coding matrix corresponding to the position 1 in the fifth variable and the current row is determined (i.e. either the same relationship or the inclusion-inclusion relationship). Find those rows whose hamming distance from the current row is not greater than the threshold value δ, i.e., those rows that are similar to the current row, and record the row index in the first variable and the third variable. And finally, returning the first variable by the algorithm to obtain the id number corresponding to the URL to be deleted.
It can be seen that the above algorithm for finding similar URLs in batches has the following advantages: (1) the massive row-to-row comparison is converted into the column-to-column logic operation, in the practical problem, the length of each code usually does not exceed 64 bits, and the total number of codes usually is massive, and the column-to-column logic operation obviously can obviously reduce the comparison times and has extremely high efficiency; (2) the comparison between the rows can only find one row which is the same as the current row at most once, and the algorithm 1 can find all the rows which are similar to the current row at one time and has lower cost (no more than 64 AND operations).
In one embodiment, the method may further comprise:
and deleting URL character strings corresponding to all target line codes similar to the current line code in the plurality of URL character strings.
Specifically, according to the ID number corresponding to the target line code of the current line code, the URL character string to be deleted is determined and deleted.
In order to further explain the method for identifying similar URL strings provided in the embodiments of the present application, the following example is provided with reference to fig. 3 to 5.
Fig. 3 shows a schematic flow chart of identifying similar URL strings provided in the embodiment of the present application, and as shown in fig. 3, firstly, URL strings are clustered according to Host and Method, and URLs identical to Host and Method are divided into the same set. Then, for each URL character string in the same set, a modified SimHash algorithm is used to generate a fixed-length binary code (the code length is determined by the data size, and the length of the code usually does not exceed 64 bits), i.e. similarity comparison between character strings is converted into similarity measurement between fingerprint codes. Finally, the steps 201 and 202 are executed to find out similar binary codes in batch, and then the binary codes are subjected to de-similarity, thereby indirectly realizing de-duplication and de-similarity of the original URL.
Fig. 4 shows a schematic diagram of an encoding matrix provided in this embodiment of the present application, the encoding matrix is a generalized SimHash encoding matrix corresponding to Scan _ key, each row in the encoding matrix represents a URL, all URLs in the set have the same Host and Method as shown in fig. 4, if it is desired to find all rows in the encoding matrix similar to a certain row (such as the first row in fig. 4), the conventional Method needs to traverse the whole set, the efficiency is low, and the space-time cost of the Method involving both horizontal comparison (i.e., comparison between different Scan _ keys) and vertical comparison (i.e., comparison between each character in Scan _ keys) is expensive. In order to find out similar rows in batch, an algorithm for finding out similar URLs in batch can be used, that is, all Scan _ keys similar to the Scan _ key after the Scan _ key is currently generalized can be found out at one time only by performing logic operation between rows and columns on binary codes corresponding to the Scan _ keys, which is equivalent to finding out all similar URLs.
Fig. 5 shows a flowchart for supporting batch finding of similar URLs according to an embodiment of the present application. Suppose that all and the first row r are to be found1Similar Scan _ key. First, r needs to be calculated1Number of (1) s1(where s is13) and then find all columns corresponding to position 1, i.e. { c }2,c4,c5}. Pass through pair c2、c4And c5To obtain all AND r1Similar Scan _ key. Using tmp as intermediate variable, initially tmp ═ c2I.e. r1The first of which is the column corresponding to the 1 position. Will tmp and c4ANDing the result to tmp, and assigning the new tmp and c5The AND operation is performed and the result is still marked with tmp. I.e., initially tmp ═ {10111}, and then tmp ═ tmp }&c4={10111}&{11110} - {10110}, and finally tmp ═ tmp&c5={10110}&{10111} - {10110 }. The final result is {10110}, which indicates r1Possibly with r3And r4Similarly. To further clarify r1And r3And r4The relation between r and r needs to be calculated respectively3And r4Number of (1) s3And s4(where s is3=4,s43). Due to s3>s1,s4=s1Description of s4And s1Are the same relationship, s3And s1Similar relationships are possible. In this example, though s3>s1But r is1And r3Has a hamming distance of 1 (assuming that the threshold δ is 3 here), they are therefore similar.
In summary, compared with the prior art, the embodiment of the application can realize effective data compression by adopting the improved SimHash algorithm, thereby saving the storage space, and in addition, by adopting the algorithm for finding out similar URLs in batch, the method can realize batch removal of the same and similar URLs in URL data sets, and can obviously reduce the comparison times. In a vulnerability scanning application scene, the efficient URL de-similarity algorithm can avoid unnecessary scanning operation on the same or similar URLs, and the vulnerability scanning efficiency is obviously improved.
In one embodiment, a similar URL string recognition apparatus is provided, as shown in fig. 6, the apparatus may include:
an obtaining module 602, configured to obtain multiple URL character strings that meet a preset condition;
the encoding module 604 is configured to perform binary encoding on the specified field in each URL character string, and generate an encoding matrix according to an encoding result of the specified field in each URL character string, where each row of codes in the encoding matrix corresponds to one URL character string;
a searching module 606, configured to search all target row codes similar to the current row code in the coding matrix for the current row code in the coding matrix;
the determining module 608 is configured to determine the URL character string corresponding to the target line code as the URL character string corresponding to the target line code of the current line code.
In one embodiment, the apparatus further comprises:
a generalization module 603, configured to delete, for a specified field in each URL string, a character in the specified field that meets a preset deletion condition;
preferably, the preset deleting condition includes:
deleting all numbers in the specified field and deleting content between special characters in the specified field.
In one embodiment, the encoding module 604 is specifically configured to:
aiming at the designated field in each URL character string, the improved SimHash algorithm is used for carrying out on the designated field to obtain the binary code with fixed length.
In one embodiment, the lookup module 606 includes:
the determining submodule is used for carrying out logical operation on columns corresponding to all elements with preset values in the current row code aiming at the current row code in the coding matrix to obtain a logical operation result;
and the searching submodule is used for searching all target row codes similar to the current row code in the coding matrix according to the logical operation result.
In one embodiment, the lookup submodule is specifically configured to:
determining all candidate row codes in the coding matrix according to the elements with the element values as preset values in the logical operation result, and calculating the Hamming distance between the current row code and each candidate row code;
and for each candidate line code, determining the candidate line code as all target line codes similar to the current line code when the Hamming distance between the current line code and the candidate line code does not exceed a preset threshold value.
In one embodiment, the apparatus further comprises:
and a deleting module 610, configured to delete, from the plurality of URL strings, all URL strings corresponding to target line codes similar to the current line code.
For specific limitations of the similar URL string identification device, reference may be made to the above limitations of the similar URL string identification method, and details are not repeated here. The modules in the above-mentioned similar URL string identification device can be implemented wholly or partially by software, hardware and their combination. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.
In one embodiment, a computer device is provided, which may be a server, the internal structure of which may be as shown in fig. 7. The server comprises a processor, a memory and a network interface which are connected through a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The nonvolatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operating system and running code of the computer program in the non-volatile storage medium. The network interface of the computer device is used to communicate with other devices via a network connection. The computer program is executed by a processor to implement a similar URL string recognition method.
Those skilled in the art will appreciate that the architecture shown in fig. 7 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.
In one embodiment, a computer device is provided, comprising a memory, a processor, and a computer program stored on the memory and executable code on the processor, the processor implementing the following steps when executing the computer program:
acquiring a plurality of URL character strings meeting preset conditions;
binary coding is carried out on the designated fields in the URL character strings, a coding matrix is generated according to the coding result of the designated fields in the URL character strings, and each row of codes in the coding matrix corresponds to one URL character string;
aiming at the current row code in the coding matrix, searching all target row codes similar to the current row code in the coding matrix;
and determining the URL character string corresponding to the target line code as the URL character string corresponding to the target line code of the current line code.
In one embodiment, a computer-readable storage medium is provided, having a computer program stored thereon, which when executed by a processor, performs the steps of:
acquiring a plurality of URL character strings meeting preset conditions;
binary coding is carried out on the designated fields in the URL character strings, a coding matrix is generated according to the coding result of the designated fields in the URL character strings, and each row of codes in the coding matrix corresponds to one URL character string;
aiming at the current row code in the coding matrix, searching all target row codes similar to the current row code in the coding matrix;
and determining the URL character string corresponding to the target line code as the URL character string corresponding to the target line code of the current line code.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware related to instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in the embodiments provided herein may include non-volatile and/or volatile memory, among others. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).
The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.
The above examples only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims (10)

1. A method for identifying similar URL strings, the method comprising:
acquiring a plurality of URL character strings meeting preset conditions;
binary coding is carried out on the designated fields in the URL character strings, a coding matrix is generated according to the coding result of the designated fields in the URL character strings, and each row of codes in the coding matrix corresponds to one URL character string;
aiming at the current row code in the coding matrix, finding out all target row codes similar to the current row code in the coding matrix;
and determining the URL character string corresponding to the target line code as the URL character string corresponding to the target line code of the current line code.
2. The method of claim 1, wherein prior to said binary encoding said specified field in each of said URL strings, said method further comprises:
and deleting the characters meeting preset deleting conditions in the specified fields aiming at the specified fields in each URL character string.
3. The method of claim 2, wherein the preset deletion condition comprises: deleting all numbers in the specified field and deleting content between special characters in the specified field.
4. The method of claim 1, wherein binary encoding the specified field in each of the URL strings comprises:
and aiming at the specified field in each URL character string, encoding the specified field by using an improved SimHash algorithm to obtain a binary code with a fixed length.
5. The method of any one of claims 1 to 4, wherein the finding all target row codes similar to the current row code in the coding matrix for the current row code in the coding matrix comprises:
aiming at the current row code in the coding matrix, performing logic operation on columns corresponding to all elements with preset values in the current row code to obtain a logic operation result;
and searching all target row codes similar to the current row code in the coding matrix according to the logical operation result.
6. The method of claim 5, wherein finding all target row codes similar to the current row code in the coding matrix according to the logical operation result comprises:
determining all candidate row codes in the coding matrix according to the elements with the element values being preset values in the logical operation result, and calculating the Hamming distance between the current row code and each candidate row code;
for each of the candidate row codes, determining the candidate row code as all target row codes similar to the current row code when the Hamming distance between the current row code and the candidate row code does not exceed a preset threshold.
7. The method of claim 1, further comprising:
and deleting URL character strings corresponding to all target line codes similar to the current line code in the plurality of URL character strings.
8. An apparatus for identifying similar URL strings, the apparatus comprising:
the acquisition module is used for acquiring a plurality of URL character strings meeting preset conditions;
the encoding module is used for carrying out binary encoding on the designated fields in the URL character strings and generating an encoding matrix according to the encoding result of the designated fields in the URL character strings, wherein each row of codes in the encoding matrix corresponds to one URL character string;
the searching module is used for searching all target row codes similar to the current row codes in the coding matrix aiming at the current row codes in the coding matrix;
and the determining module is used for determining the URL character string corresponding to the target line code as the URL character string corresponding to the target line code of the current line code.
9. A computer device comprising a memory, a processor and a computer program stored on the memory and executable code on the processor, wherein the steps of the method of any one of claims 1 to 7 are implemented when the computer program is executed by the processor.
10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 7.
CN202110494522.1A 2021-05-07 2021-05-07 Similar URL character string recognition method and device, computer equipment and storage medium Pending CN113282849A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110494522.1A CN113282849A (en) 2021-05-07 2021-05-07 Similar URL character string recognition method and device, computer equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110494522.1A CN113282849A (en) 2021-05-07 2021-05-07 Similar URL character string recognition method and device, computer equipment and storage medium

Publications (1)

Publication Number Publication Date
CN113282849A true CN113282849A (en) 2021-08-20

Family

ID=77278282

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110494522.1A Pending CN113282849A (en) 2021-05-07 2021-05-07 Similar URL character string recognition method and device, computer equipment and storage medium

Country Status (1)

Country Link
CN (1) CN113282849A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP7464804B1 (en) 2024-01-10 2024-04-09 株式会社ユービーセキュア Security Test System

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108959359A (en) * 2018-05-16 2018-12-07 顺丰科技有限公司 A kind of uniform resource locator semanteme De-weight method, device, equipment and medium
CN112395877A (en) * 2020-11-04 2021-02-23 苏宁云计算有限公司 Character string detection method and device, computer equipment and storage medium

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108959359A (en) * 2018-05-16 2018-12-07 顺丰科技有限公司 A kind of uniform resource locator semanteme De-weight method, device, equipment and medium
CN112395877A (en) * 2020-11-04 2021-02-23 苏宁云计算有限公司 Character string detection method and device, computer equipment and storage medium

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP7464804B1 (en) 2024-01-10 2024-04-09 株式会社ユービーセキュア Security Test System

Similar Documents

Publication Publication Date Title
US11683330B2 (en) Network anomaly data detection method and device as well as computer equipment and storage medium
US20160127388A1 (en) Similarity search and malware prioritization
EP2674884A1 (en) Method, system and computer-readable recording medium for adding a new image and information on the new image to an image database
WO2019109529A1 (en) Webpage identification method, device, computer apparatus, and computer storage medium
CN110990390B (en) Data cooperative processing method, device, computer equipment and storage medium
CN108717426B (en) Enterprise data updating method and device, computer equipment and storage medium
JP2015508543A (en) Processing store visit data
WO2019148712A1 (en) Phishing website detection method, device, computer equipment and storage medium
US20160019211A1 (en) A process for obtaining candidate data from a remote storage server for comparison to a data to be identified
CN104079559A (en) Web address security detecting method and device and server
CN112199344A (en) Log classification method and device
CN113282849A (en) Similar URL character string recognition method and device, computer equipment and storage medium
CN112217815B (en) Phishing website identification method and device and computer equipment
CN109460500B (en) Hotspot event discovery method and device, computer equipment and storage medium
CN110460685B (en) User unique identifier processing method and device, computer equipment and storage medium
CN112347477A (en) Family variant malicious file mining method and device
CN108460116B (en) Search method, search device, computer equipment, storage medium and search system
WO2023093017A1 (en) Method and apparatus for identifying web service device
CN115544007A (en) Label preprocessing method and device, computer equipment and storage medium
CN110825976B (en) Website page detection method and device, electronic equipment and medium
CN109656948B (en) Bitmap data processing method and device, computer equipment and storage medium
CN110460585B (en) Equipment identity identification method and device, computer equipment and storage medium
CN114238334A (en) Heterogeneous data encoding method and device, heterogeneous data decoding method and device, computer equipment and storage medium
CN113992625A (en) Domain name source station detection method, system, computer and readable storage medium
CN113946365A (en) Page identification method and device, computer equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20210820