CN106446068B - Directory database generation and query method and device - Google Patents

Directory database generation and query method and device Download PDF

Info

Publication number
CN106446068B
CN106446068B CN201610806972.9A CN201610806972A CN106446068B CN 106446068 B CN106446068 B CN 106446068B CN 201610806972 A CN201610806972 A CN 201610806972A CN 106446068 B CN106446068 B CN 106446068B
Authority
CN
China
Prior art keywords
directory
website
target website
source code
directory database
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201610806972.9A
Other languages
Chinese (zh)
Other versions
CN106446068A (en
Inventor
郭燕慧
孙博文
徐国爱
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing University of Posts and Telecommunications
Original Assignee
Beijing University of Posts and Telecommunications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing University of Posts and Telecommunications filed Critical Beijing University of Posts and Telecommunications
Priority to CN201610806972.9A priority Critical patent/CN106446068B/en
Publication of CN106446068A publication Critical patent/CN106446068A/en
Application granted granted Critical
Publication of CN106446068B publication Critical patent/CN106446068B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/958Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The invention provides a method and a device for generating and querying a directory database, wherein the generating method comprises the following steps: acquiring a website framework of a target website, and determining a crawler strategy according to the website framework; acquiring source code data of a target website according to a crawler strategy; determining a primary key of the source code data in the directory database according to the path of the acquired source code data, and determining a secondary key of the source code data in the directory database according to the characteristic parameters of the source code data; and determining a storage directory of the source code data of the target website in the directory database according to the master key and the slave key, and storing the source code data to generate the directory database. In the invention, a large amount of data information is stored in the generated directory database, and the storage path of the source code data of the target website in the directory database is determined according to the master key and the slave key, so that directory query can be conveniently carried out.

Description

Directory database generation and query method and device
Technical Field
The invention relates to the technical field of computers and information security, in particular to a method and a device for generating and querying a directory database.
Background
The WEB site is the most popular WEB site applied to the internet, and therefore, the WEB site is often subjected to malicious attacks, wherein the directory scan attack is an attack means with strong universality and high harm. In the directory scanning attack, an attacker initiates a Hypertext transfer protocol (HTTP) request by iterating a large number of directories and file name lists to acquire website directory information topology, so that sensitive information such as an uploading page, a background login page and the like is exposed, and once the attacker acquires the information, the security protection of the whole website system can be realized by one step.
In order to prevent the website from being attacked, the security of the website needs to be detected, and a vulnerability directory in the website is found, so that the network architecture can be modified in time, and the security of the website can be protected.
In the prior art, a website directory database is mostly scanned by adopting a directory scanning technology to search for a vulnerability directory in a website, but the website directory database generated in the prior art only records information possibly existing in some websites, the stored data volume is small, classified storage is lacked, and the directory database in the prior art is hidden, so that the vulnerability directory of the website is difficult to scan, and the website is not favorable for security detection.
Disclosure of Invention
In view of this, an embodiment of the present invention provides a method and an apparatus for generating and querying a directory database, so as to solve the problems that the website directory database in the prior art is small in stored data amount and lacks of classified storage, so that it is difficult to scan a fragile directory of a website, and it is not beneficial to perform security detection on the website.
In a first aspect, an embodiment of the present invention provides a method for generating a directory database, where the method includes:
acquiring a website framework of a target website, and determining a crawler strategy according to the website framework;
acquiring source code data of the target website according to the crawler strategy;
determining a primary key of the source code data in a directory database according to a path of the obtained source code data, and determining a secondary key of the source code data in the directory database according to characteristic parameters of the source code data, wherein the characteristic parameters comprise: the type of the source code data, the internet protocol address, the scripting language, the file type, the directory depth and the directory breadth;
and determining a storage directory of the source code data of the target website in the directory database according to the primary key and the secondary key, storing the source code data, and generating the directory database.
With reference to the first aspect, an embodiment of the present invention provides a first possible implementation manner of the first aspect, where the determining a crawler policy according to the website framework includes:
determining the directory depth of the target website according to the website architecture of the target website, and determining the directory depth as the crawling depth;
analyzing the website architecture of the target website, and constructing universal uniform resource locators suitable for all source codes on the target website;
setting encrypted data and requested header information of the target website according to the website architecture of the target website;
determining the crawling depth, the universal uniform resource locator, the encrypted data of the target website and the header information of the request as the crawler policy.
With reference to the first aspect, an embodiment of the present invention provides a second possible implementation manner of the first aspect, where the obtaining source code data of the target website according to the crawler policy includes:
analyzing a source code compression packet from the webpage data stream of the target website in a multithreading mode;
and processing the source code compressed packet to obtain the source code data of the target website.
With reference to the second possible implementation manner of the first aspect, an embodiment of the present invention provides a third possible implementation manner of the first aspect, where the processing the source code compressed packet to obtain the source code data of the target website includes:
decompressing the source code compressed packet to obtain the original data of the source code;
screening the original data, and screening out pictures, scripts and invalid link data in the original data;
and denoising and screening the screened original data, and screening out repeated original data to obtain the source code data of the target website.
In a second aspect, an embodiment of the present invention provides a directory database query method, which is applied to a directory database generated by the above directory database generation method, where the method includes:
setting a crawler strategy of a target website, and acquiring a website topological graph of the target website according to the crawler strategy, wherein the crawler strategy comprises setting a crawler starting uniform resource locator, target website encrypted data and requested header information;
and according to the website topological graph of the target website and a pre-generated directory database, inquiring a directory database subset matched with the target website from the directory database by adopting a pattern matching method, wherein the pattern matching comprises website structural information matching and directory tree structure matching.
With reference to the second aspect, an embodiment of the present invention provides a first possible implementation manner of the second aspect, where the pattern matching includes matching of website structural information, and the screening, according to the website topology map of the target website and a pre-generated catalog database, a catalog database subset matching the target website from the catalog database includes:
acquiring structural information of the target website from the website topological graph, wherein the structural information comprises one or more of a scripting language of the target website, a directory depth of the target website and a directory breadth of the target website;
performing state conversion on the directory database subsets in the directory database by adopting a permutation function according to the structural information of the target website, and screening out the directory database subsets which are not matched with the structural information;
and determining the rest directory database subset in the directory database as the directory database subset matched with the target website.
With reference to the second aspect, an embodiment of the present invention provides a second possible implementation manner of the second aspect, where the pattern matching includes directory tree structure matching, and the querying, according to the website topology map of the target website and a pre-generated directory database, a directory database subset matched with the target website from the directory database includes:
converting a subset of a directory database in the directory database into a directory tree structure;
selecting a directory tree structure with the similarity to the target website greater than or equal to a preset value from the directory tree structures in the directory database by adopting a pattern matching algorithm;
and determining the directory database subset corresponding to the selected directory tree structure as the directory database subset matched with the target website.
In a third aspect, an embodiment of the present invention provides a device for generating a directory database, where the device includes:
the first determining module is used for acquiring a website architecture of a target website and determining a crawler strategy according to the website architecture;
the acquisition module is used for acquiring the source code data of the target website according to the crawler strategy;
a second determining module, configured to determine, according to a path where the source code data is obtained, a primary key of the source code data in a directory database, and determine, according to a feature parameter of the source code data, a secondary key of the source code data in the directory database, where the feature parameter includes: the type of the source code data, the internet protocol address, the scripting language, the file type, the directory depth and the directory breadth;
and the generating module is used for determining a storage directory of the source code data of the target website in the directory database according to the main key and the auxiliary key, storing the source code data and generating the directory database.
With reference to the third aspect, an embodiment of the present invention provides a first possible implementation manner of the third aspect, where the obtaining module includes:
the analysis unit is used for analyzing the source code compressed packet from the webpage data stream of the target website in a multithreading mode;
and the acquisition unit is used for processing the source code compressed packet to acquire the source code data of the target website.
In a fourth aspect, an embodiment of the present invention provides a catalog database query apparatus, which is applied to the catalog database generated by the catalog database generation apparatus, where the apparatus includes:
the system comprises an acquisition module, a comparison module and a display module, wherein the acquisition module is used for setting a crawler strategy of a target website and acquiring a website topological graph of the target website according to the crawler strategy, and the crawler strategy comprises setting a crawler starting uniform resource locator, encrypted data of the target website and requested header information;
and the query module is used for querying a directory database subset matched with the target website from the directory database by adopting a pattern matching method according to the website topological graph of the target website and a pre-generated directory database, wherein the pattern matching comprises website structural information matching and directory tree structure matching.
According to the method and the device for generating and querying the directory database, the generated directory database stores a large amount of data information, and the storage path of the source code data of the target website in the directory database is determined according to the master key and the slave key, so that directory query can be conveniently performed, and therefore, a fragile directory is easier to scan when directory scanning is performed, the website architecture is adjusted in time, and the safety of the website is protected.
In order to make the aforementioned and other objects, features and advantages of the present invention comprehensible, preferred embodiments accompanied with figures are described in detail below.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the embodiments will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present invention and therefore should not be considered as limiting the scope, and for those skilled in the art, other related drawings can be obtained according to the drawings without inventive efforts.
Fig. 1 is a flowchart illustrating a directory database generation method provided in embodiment 1 of the present invention;
FIG. 2 is a flowchart illustrating a directory database query method according to embodiment 2 of the present invention;
fig. 3 is a schematic structural diagram illustrating a directory database generation apparatus provided in embodiment 3 of the present invention;
fig. 4 is a schematic structural diagram illustrating a directory database query apparatus according to embodiment 4 of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. The components of embodiments of the present invention generally described and illustrated in the figures herein may be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the present invention, presented in the figures, is not intended to limit the scope of the invention, as claimed, but is merely representative of selected embodiments of the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the present invention without making any creative effort, shall fall within the protection scope of the present invention.
In consideration of the fact that in the prior art, in order to prevent a website from being attacked by a directory, a directory database of the website needs to be scanned to find a fragile directory of the website so as to modify a website architecture in time and protect the website security, a directory scanning technology is mostly adopted to scan the website directory database in the prior art, but the existing directory database only records information possibly existing in some websites, the stored data volume is small, and the directory database in the prior art is hidden, so that the fragile directory of the website is difficult to scan, and the website security detection is not facilitated. Based on this, embodiments of the present invention provide a method and an apparatus for generating and querying a directory database, which are described below by way of embodiments.
Example 1
The embodiment of the invention provides a method for generating a directory database, the directory database generated by the method provided by the embodiment of the invention has larger data volume, and the master key and the slave key of website source code data in the directory database are set, so that directory query is facilitated.
As shown in fig. 1, the method for generating a directory database according to the embodiment of the present invention includes steps S110 to S140, which are described as follows.
S110, acquiring a website framework of the target website, and determining a crawler strategy according to the website framework.
The target website is an open source website building platform, for example, the target website may be Github, sitter, or the like, or may be other websites, and the embodiment of the present invention is not limited to the specific type of the target website.
Before determining a crawler policy of a target website, firstly, a website framework of the target website needs to be acquired, the website framework of the target website can be acquired through a web crawler technology, each node of the website is traversed and crawled, the website framework of the target website is acquired, and after the website framework of the target website is acquired, the crawler policy is determined according to the website framework, and the method specifically comprises the following steps: determining the directory depth of a target website according to the website architecture of the target website, and determining the directory depth as the crawling depth; analyzing the website architecture of the target website, and constructing Universal Resource Locators (URLs) suitable for all source codes on the target website; setting encrypted data and requested header information of the target website according to the website architecture of the target website; and determining the crawling depth, the universal URL, the encrypted data of the target website and the header information of the request as the crawler policy.
In the embodiment of the invention, the framework of the target website can be analyzed by the staff to determine the download page of the source code data packet, namely, the page on which the source code data packet can be downloaded and the directory depth of the target website, and the directory depth of the target website is determined as the depth required to be crawled.
Since many source codes can be downloaded from each website, a universal URL suitable for all source codes of the target website can be constructed for downloading convenience, so that when downloading the source codes which cannot be downloaded, only the source code name of the universal URL of the target website needs to be changed, and there is no need to set a URL for downloading one source code, for example, www.xxx.com/? The URL is a source name, so that different source codes can be downloaded only by changing the following source code name, and when downloading different source codes, the content in front of the universal URL does not need to be modified.
In addition, some websites may have an anti-crawler mechanism, and do not allow downloading the source code data packet of the website in a large scale, so in order to obtain the source code compressed packet of the website, the encrypted data and request header information of the target website may be set, and in this way, the anti-crawler mechanism may be avoided, so that the source code compressed packet of the target website may be downloaded in a large scale, the encrypted data of the target website may be a cookie of the target website, and the request header information may be requested refer header information.
And determining the determined crawling depth, the universal URL, the cookie and the refer head information as the crawler strategy of the target website.
And S120, acquiring the source code data of the target website according to the crawler strategy.
After the crawler strategy of the target website is determined, the source code data of the target website is obtained according to the determined crawler strategy of the target website, and the specific process comprises the following steps: analyzing a source code compression packet from a webpage data stream of a target website in a multithreading mode; and processing the source code compressed packet to obtain the source code data of the target website.
In the embodiment of the present invention, in order to increase the download rate of the source code compressed packet, the source code compressed packet may be parsed and downloaded from the web page data stream in a multithreading manner, that is, multiple programs run simultaneously, and the source code compressed packet is stored after being parsed from the web page data stream.
And the obtaining of the source code data of the target website from the source code compressed packet specifically includes: decompressing the source code compressed packet to obtain the original data of the source code; screening the original data, and screening out pictures, scripts and invalid link data in the original data; and denoising and screening the screened original data, and screening out repeated original data to obtain source code data of the target website.
In the embodiment of the present invention, the source code compressed packets are downloaded from the web page data stream, so that decompression processing needs to be performed on the source code compressed packets first, and during the decompression processing, the source code compressed packets may be classified according to the format of the source code compressed packets, and then the source code compressed packets are decompressed according to different categories, so as to obtain a large amount of original data, where the original data includes content such as source code data, invalid links, pictures, and scripts, and so on.
S130, determining a primary key of the source code data in the directory database according to the path of the acquired source code data, and determining a secondary key of the source code data in the directory database according to characteristic parameters of the source code data, wherein the characteristic parameters comprise: type of source code data, Internet Protocol (IP) address, scripting language, file type, directory depth, and directory breadth.
The path for acquiring the source code data refers to a path for analyzing the source code data from the source code compressed packet, and the path for acquiring the source code data is used as a main key, so that repeated paths cannot appear in the directory database.
The characteristic parameters may further include the number of occurrences of the source code data and the number of occurrences of the source code data.
And S140, determining a storage directory of the source code data of the target website in the directory database according to the main key and the auxiliary key, storing the source code data, and generating the directory database.
And after the primary key and the secondary key of the source code data in the directory database are determined, determining a storage directory of the source code data in the directory database according to the determined primary key and the determined secondary key, and storing the source code data of the target website according to the determined storage directory, thereby generating the directory database of the website.
The generated catalog database can store the source code data of a plurality of websites, the source code data of each website can be stored in a catalog database subset, or the source code data of the same type of websites is stored in a catalog database subset, and the catalog database is formed by a plurality of catalog database subsets.
The same type of websites refer to websites with the same scripting language, or websites with the same or similar website architecture, and the embodiments of the present invention do not limit which parameters of the same type of websites specifically refer to websites with the same or similar parameters.
The storage directory of the source code data of the target website in the directory database is determined according to the main key and the auxiliary key, and the directory database generated by the method can enable a user to query the directory database subset according to the structural information of the website, for example, the directory database subset of a homologous website can be queried, the directory database subset of a website with the same script can be queried, and the directory database subset of a similar website can be queried according to the directory depth and the directory breadth of the website.
According to the directory database generation method provided by the embodiment of the invention, a large amount of data information is stored in the generated directory database, and the storage path of the source code data of the target website in the directory database is determined according to the master key and the slave key, so that directory query can be conveniently carried out, and therefore, a fragile directory is easier to scan when directory scanning is carried out, so that the website architecture is adjusted in time, and the safety of the website is protected.
Example 2
The embodiment of the invention provides a directory database query method, which is applied to a directory database generated by the directory database generation method in the embodiment 1 of the invention.
As shown in fig. 2, when the directory database query method provided by the embodiment of the present invention is used to query the directory database subset matched with the target website, the method specifically includes steps S210 to S220.
S210, a crawler strategy of the target website is set, a website topological graph (Sitemap) of the target website is obtained according to the crawler strategy, and the crawler strategy comprises a set crawler starting URL, encrypted data of the target website and requested header information.
When a web crawler technology is adopted to obtain a Sitemap of a target website, a crawling start URL needs to be set, so that the website is provided with an anti-crawler mechanism, in order to avoid the anti-crawler mechanism of some websites, encrypted data of the target website and requested header information need to be set, the encrypted data of the target website can be a cookie of the target website, the requested header information can be a requested header, the set start URL of the crawler, the encrypted data of the target website and the requested header information are determined to be a crawler strategy of the target website, and the target website is crawled according to the crawler strategy.
When the target website is crawled, traversing each node of the target website by adopting a Breadth-First-Search algorithm (BFS) and a Depth-First-Search algorithm (DFS) to ensure that the data of the crawled target website is comprehensive, using a beautifull soup to crawl data from each webpage of the target website, continuously adding crawled URLs under the same domain name into a queue, and stopping crawling when an exit condition is met, wherein the exit condition can be that each node of the target website is already traversed, and storing the crawled data, wherein the crawled data comprises a target website Sitemap which also comprises structural information of the target website, and the structural information comprises the language of the target website, the directory Depth and the directory Breadth of the target website.
And S220, according to the Sitemap of the target website and a pre-generated directory database, inquiring a directory database subset matched with the target website from the directory database by adopting a pattern matching method, wherein the pattern matching comprises website structural information matching and directory tree structure matching.
When the pattern matching is structural information matching, searching a directory database subset matched with the target website from a directory database, wherein the searching comprises the following steps: acquiring structural information of a target website from the Sitemap, wherein the structural information comprises one or more of a scripting language of the target website, a directory depth of the target website and a directory breadth of the target website; performing state conversion on the directory database subsets in the directory database by adopting a permutation function according to the structural information of the target website, and screening out the directory database subsets which are not matched with the structural information; and determining the rest directory database subset in the directory database as the directory database subset matched with the target website.
In the embodiment of the invention, a pattern matching algorithm based on an automatic state machine is adopted to obtain a directory database subset matched with a target website, a deterministic finite state automaton is adopted in the invention, and the method is represented by five elements as follows: m ═ K ═ e, f, S, Z), where K denotes a finite set, which denotes the set of all states in the deterministic finite state automata, i.e. all directory database subsets in the directory database in the embodiment of the present invention, Σ denotes a finite set of elements, each element represents an input character, i.e. each element represents a structured information of the target website, f is a permutation function, which is an image on K × Σ, each time a structured information is input, the directory database subsets in the directory database are filtered once, S is the only initial state in the deterministic finite state automata, and S ∈ K, Z denotes a set of final states, which is a true subset of K, and the elements in Z denote the end of a pattern.
When the structural information of the target website is input into the replacement function, the directory database subset in the replacement function is converted into another directory database subset, and the directory database subset can be converted into a new directory database subset by the formula f (ki, a) kj (ki is equal to K, kj is equal to Sigma), wherein ki is the current directory database subset, a is the structural information of the input target website, the structured information may be the directory depth of the target website, or the directory breadth of the target website, and through the above-mentioned permutation function, the current directory database subset ki may be converted to another directory database subset kj, which conforms to the structured information of the target web site, by the method, the directory database subset which is not matched with the structural information of the target website can be screened out, and the remaining directory database subset in the directory database is determined to be the directory database subset which is matched with the target website.
When the pattern match comprises a directory tree structure match, querying a directory database subset matched with the target website from a directory database, wherein the query comprises: converting a directory database subset in a directory database into a directory tree structure; selecting a directory number structure with the similarity of Sitemap of a target website greater than or equal to a preset value from a directory tree structure in a directory database by adopting a pattern matching algorithm; and determining the directory database subset corresponding to the selected directory tree structure as the directory database subset matched with the target website.
The node matching of The directory tree structure to The Sitemap of The target web site may be performed using The knudt-Morris-Pratt Algorithm (KMP) matching Algorithm.
And when the similarity between the directory number structure and the Sitemap of the target website is greater than or equal to the preset value, judging that the directory number structure is the same as or similar to the target website, and determining the directory database subset corresponding to the directory tree structure as the directory database subset matched with the target website.
The target website and the nodes of the directory tree structure can be judged to be the same or similar by comparing the Sitemap of the target website with the nodes of the directory tree structure and judging if the proportion of the same nodes to the total nodes reaches a preset value.
For example, the directory number structure matching the target website is a three-level directory, the first-level directory is a node, the second-level directory is two nodes, each node of the second-level directory has three nodes on the third-level directory, the Sitemap of the target website also has three-level directories, the first-level directory is a node, the second-level directory is two nodes, and each node of the second-level directory also has three nodes on the third-level directory, in this case, it can be determined that the directory tree structure is the same as the Sitemap of the target website, and of course, the above description is only an example to illustrate the specific process of comparison, if the nodes of the directory tree structure and the Sitemap of the target website are not completely the same, but have a difference of one point, or the directory tree structure is determined to be the same as or similar to the target website, and the specific condition determined to be the same or similar can be set according to the actual situation, the embodiments of the present invention do not limit the specific conditions that are determined to be the same or similar.
The directory database subset under the same source code in the directory database can establish a directory number structure, that is, the directory database subset corresponding to a website can establish a directory number structure.
And determining the inquired directory database subset matched with the target website as a directory dictionary of the target website, scanning the directory dictionary by adopting a directory scanning technology, searching for the vulnerability directory in the directory dictionary, and modifying the website architecture of the target website in time when the vulnerability directory is searched.
The method may be implemented by querying only the subset of the directory database matched with the target website by using a structured information matching method, or by querying only the subset of the directory database matched with the target website by using a directory number structure matching method, or may be implemented by screening the subset of the directory database matched with the target website by using a structured information matching method, and further screening the screened subset of the directory database by using a directory tree structure matching method to obtain the subset of the directory database matched with the target website, or may be implemented by screening the subset of the directory database matched with the target website by using a directory tree structure, and further screening the screened subset of the directory database by using structured information to obtain the subset of the directory database matched with the target website.
If the directory dictionary cannot be found to be the vulnerable directory, the query condition may be relaxed appropriately, for example, the preset value may be reduced, or some structured information for performing directory database subset screening may be reduced, so as to obtain the directory dictionary after the condition is relaxed, and then the directory dictionary is scanned.
According to the directory database query method provided by the embodiment of the invention, the directory database subset matched with the target website is queried from the directory database through the pattern matching method, so that the query efficiency is improved.
Example 3
The embodiment of the invention provides a device for generating a directory database, which is used for executing the method for generating the directory database provided by the embodiment 1 of the invention.
As shown in fig. 3, the apparatus for generating a directory database according to the embodiment of the present invention includes a first determining module 310, an obtaining module 320, a second obtaining module 330, and a generating module 340;
the first determining module 310 is configured to obtain a website framework of a target website, and determine a crawler policy according to the website framework;
the obtaining module 320 is configured to obtain source code data of a target website according to the crawler policy;
the second determining module 330 is configured to determine a primary key of the source code data in the directory database according to the path of the source code data, and determine a secondary key of the source code data in the directory database according to a characteristic parameter of the source code data, where the characteristic parameter includes: the type, IP address, scripting language, file type, directory depth and directory breadth of the source code data;
the generating module 340 is configured to determine a storage directory of the source code data of the target website in the directory database according to the primary key and the secondary key, store the source code data, and generate the directory database.
The target website is an open source website building platform, for example, the target website may be Github, sitter, or the like, or may be other websites, and the embodiment of the present invention is not limited to the specific type of the target website.
The obtaining module 320 obtains the source code data of the target website according to the crawler policy by an analyzing unit and an obtaining unit, and specifically includes:
the analysis unit is used for analyzing the source code compressed packet from the webpage data stream of the target website in a multithreading mode; the acquisition unit is used for processing the source code compressed packet and acquiring the source code data of the target website.
According to the catalog database generation device provided by the embodiment of the invention, a large amount of data information is stored in the generated catalog database, and the storage path of the source code data of the target website in the catalog database is determined according to the master key and the slave key, so that catalog inquiry can be conveniently carried out, and therefore, a fragile catalog can be more easily scanned when catalog scanning is carried out, so that the website architecture can be adjusted in time, and the safety of the website is protected.
Example 4
The embodiment of the present invention provides a directory database query device, which is applied to a directory database generated by a directory database generation device in embodiment 3 of the present invention, and is used for executing a directory database query method provided in embodiment 2 of the present invention.
As shown in fig. 4, the directory database query apparatus provided in the embodiment of the present invention includes an obtaining module 410 and a querying module 420;
the obtaining module 410 is configured to set a crawler policy of the target website, and obtain a Sitemap of the target website according to the crawler policy, where the crawler policy includes setting a crawler start URL, encrypted data of the target website, and header information of the request;
the query module 420 is configured to query, according to the Sitemap of the target website and the pre-generated directory database, a directory database subset matched with the target website from the directory database by using a pattern matching method, where the pattern matching includes website structural information matching and directory tree structure matching.
When a web crawler technology is adopted to obtain a Sitemap of a target website, a crawling start URL needs to be set, so that the website is provided with an anti-crawler mechanism, in order to avoid the anti-crawler mechanism of some websites, encrypted data of the target website and requested header information need to be set, the encrypted data of the target website can be a cookie of the target website, the requested header information can be a requested header, the set start URL of the crawler, the encrypted data of the target website and the requested header information are determined to be a crawler strategy of the target website, and the target website is crawled according to the crawler strategy.
According to the directory database query device provided by the embodiment of the invention, the directory database subset matched with the target website is queried from the directory database through the pattern matching method, so that the query efficiency is improved.
The catalog database generation device and the catalog database query device provided by the embodiment of the invention can be specific hardware on equipment or software or firmware installed on the equipment and the like. The device provided by the embodiment of the present invention has the same implementation principle and technical effect as the method embodiments, and for the sake of brief description, reference may be made to the corresponding contents in the method embodiments without reference to the device embodiments. It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the foregoing systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.
In the embodiments provided in the present invention, it should be understood that the disclosed apparatus and method may be implemented in other ways. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only one logical division, and there may be other divisions when actually implemented, and for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection of devices or units through some communication interfaces, and may be in an electrical, mechanical or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments provided by the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.
The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus once an item is defined in one figure, it need not be further defined and explained in subsequent figures, and moreover, the terms "first", "second", "third", etc. are used merely to distinguish one description from another and are not to be construed as indicating or implying relative importance.
Finally, it should be noted that: the above-mentioned embodiments are only specific embodiments of the present invention, which are used for illustrating the technical solutions of the present invention and not for limiting the same, and the protection scope of the present invention is not limited thereto, although the present invention is described in detail with reference to the foregoing embodiments, those skilled in the art should understand that: any person skilled in the art can modify or easily conceive the technical solutions described in the foregoing embodiments or equivalent substitutes for some technical features within the technical scope of the present disclosure; such modifications, changes or substitutions do not depart from the spirit and scope of the present invention in its spirit and scope. Are intended to be covered by the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the appended claims.

Claims (9)

1. A method for generating a catalog database, the method comprising:
acquiring a website framework of a target website, and determining a crawler strategy according to the website framework;
acquiring source code data of the target website according to the crawler strategy;
determining a primary key of the source code data in a directory database according to a path of the obtained source code data, and determining a secondary key of the source code data in the directory database according to characteristic parameters of the source code data, wherein the characteristic parameters comprise: the type of the source code data, the internet protocol address, the scripting language, the file type, the directory depth and the directory breadth;
and determining a storage directory of the source code data of the target website in the directory database according to the primary key and the secondary key, storing the source code data, and generating the directory database.
2. The method of claim 1, wherein determining a crawler policy from the website framework comprises:
determining the directory depth of the target website according to the website architecture of the target website, and determining the directory depth as the crawling depth;
analyzing the website architecture of the target website, and constructing universal uniform resource locators suitable for all source codes on the target website;
setting encrypted data and requested header information of the target website according to the website architecture of the target website;
determining the crawling depth, the universal uniform resource locator, the encrypted data of the target website and the header information of the request as the crawler policy.
3. The method of claim 1, wherein the obtaining source code data of the target website according to the crawler policy comprises:
analyzing a source code compression packet from the webpage data stream of the target website in a multithreading mode;
and processing the source code compressed packet to obtain the source code data of the target website.
4. The method of claim 3, wherein the processing the source code compressed packet to obtain the source code data of the target website comprises:
decompressing the source code compressed packet to obtain the original data of the source code;
screening the original data, and screening out pictures, scripts and invalid link data in the original data;
and denoising and screening the screened original data, and screening out repeated original data to obtain the source code data of the target website.
5. A directory database query method applied to a directory database generated by the directory database generation method according to any one of claims 1 to 4, the method comprising:
setting a crawler strategy of a target website, and acquiring a website topological graph of the target website according to the crawler strategy, wherein the crawler strategy comprises setting a crawler starting uniform resource locator, target website encrypted data and requested header information;
according to the website topological graph of the target website and a pre-generated directory database, a pattern matching method is adopted to query a directory database subset matched with the target website from the directory database, and the pattern matching comprises website structural information matching and directory tree structure matching;
when the pattern matching comprises matching of website structural information, the step of screening a directory database subset matched with the target website from the directory database according to the website topological graph of the target website and a pre-generated directory database comprises the following steps:
acquiring structural information of the target website from the website topological graph, wherein the structural information comprises one or more of a scripting language of the target website, a directory depth of the target website and a directory breadth of the target website;
performing state conversion on the directory database subsets in the directory database by adopting a permutation function according to the structural information of the target website, and screening out the directory database subsets which are not matched with the structural information;
and determining the rest directory database subset in the directory database as the directory database subset matched with the target website.
6. The method of claim 5, wherein the pattern matching comprises directory tree structure matching, and the querying a subset of directory databases matching the target website from the directory databases according to the website topology map of the target website and a pre-generated directory database comprises:
converting a subset of a directory database in the directory database into a directory tree structure;
selecting a directory tree structure with the similarity to the target website greater than or equal to a preset value from the directory tree structures in the directory database by adopting a pattern matching algorithm;
and determining the directory database subset corresponding to the selected directory tree structure as the directory database subset matched with the target website.
7. A catalog database generation apparatus, the apparatus comprising:
the first determining module is used for acquiring a website architecture of a target website and determining a crawler strategy according to the website architecture;
the acquisition module is used for acquiring the source code data of the target website according to the crawler strategy;
a second determining module, configured to determine, according to a path where the source code data is obtained, a primary key of the source code data in a directory database, and determine, according to a feature parameter of the source code data, a secondary key of the source code data in the directory database, where the feature parameter includes: the type of the source code data, the internet protocol address, the scripting language, the file type, the directory depth and the directory breadth;
and the generating module is used for determining a storage directory of the source code data of the target website in the directory database according to the main key and the auxiliary key, storing the source code data and generating the directory database.
8. The apparatus of claim 7, wherein the obtaining module comprises:
the analysis unit is used for analyzing the source code compressed packet from the webpage data stream of the target website in a multithreading mode;
and the acquisition unit is used for processing the source code compressed packet to acquire the source code data of the target website.
9. A catalog database query apparatus applied to a catalog database generated by the catalog database generation apparatus according to any one of claims 7 to 8, the apparatus comprising:
the system comprises an acquisition module, a storage module and a display module, wherein the acquisition module is used for setting a crawler strategy of a target website and acquiring a website topological graph of the target website according to the crawler strategy, and the crawler strategy comprises setting a crawler starting uniform resource locator, target website encryption information and requested header information;
the query module is used for querying a directory database subset matched with the target website from the directory database by adopting a pattern matching method according to the website topological graph of the target website and a pre-generated directory database, wherein the pattern matching comprises website structural information matching and directory tree structure matching;
when the pattern match comprises a website structured information match, the query module is to:
acquiring structural information of the target website from the website topological graph, wherein the structural information comprises one or more of a scripting language of the target website, a directory depth of the target website and a directory breadth of the target website;
performing state conversion on the directory database subsets in the directory database by adopting a permutation function according to the structural information of the target website, and screening out the directory database subsets which are not matched with the structural information;
and determining the rest directory database subset in the directory database as the directory database subset matched with the target website.
CN201610806972.9A 2016-09-06 2016-09-06 Directory database generation and query method and device Active CN106446068B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610806972.9A CN106446068B (en) 2016-09-06 2016-09-06 Directory database generation and query method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610806972.9A CN106446068B (en) 2016-09-06 2016-09-06 Directory database generation and query method and device

Publications (2)

Publication Number Publication Date
CN106446068A CN106446068A (en) 2017-02-22
CN106446068B true CN106446068B (en) 2020-02-07

Family

ID=58165210

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610806972.9A Active CN106446068B (en) 2016-09-06 2016-09-06 Directory database generation and query method and device

Country Status (1)

Country Link
CN (1) CN106446068B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2018225576A1 (en) * 2017-06-06 2018-12-13 オムロン株式会社 Score calculation unit, retrieval device, score calculation method, and score calculation program
CN110232146B (en) * 2019-04-30 2022-05-31 北京邮电大学 Data grabbing method and grabbing device
CN114842146A (en) * 2022-05-10 2022-08-02 中国民用航空飞行学院 Civil aviation engine maintenance manual and work card modeling method and storable medium
CN115632817B (en) * 2022-09-22 2023-09-05 浪潮卓数大数据产业发展有限公司 Method and device for preventing climbing of An Zhuo Duan

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1486457A (en) * 2000-11-21 2004-03-31 ��ķɭ��ɹ�˾ System and process for mediated crawling
CN101520798A (en) * 2009-03-06 2009-09-02 苏州锐创通信有限责任公司 Webpage classification technology based on vertical search and focused crawler
CN101676907A (en) * 2008-09-16 2010-03-24 北京雷速科技有限公司 Method and system of directionally acquiring Internet resources
DE102013000611A1 (en) * 2013-01-16 2014-07-17 i-market GmbH Automatic method for recognizing brochures, catalogs or prospectus on websites of organizations, involves detecting and storing source code of to-be examined website by crawler or selecting source code completely or partially from database
CN105183749A (en) * 2015-07-13 2015-12-23 北京奇虎科技有限公司 Method and device for crawling promotion content and providing crawled promotion content for use in search

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040021682A1 (en) * 2002-07-31 2004-02-05 Pryor Jason A. Intelligent product selector

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1486457A (en) * 2000-11-21 2004-03-31 ��ķɭ��ɹ�˾ System and process for mediated crawling
CN101676907A (en) * 2008-09-16 2010-03-24 北京雷速科技有限公司 Method and system of directionally acquiring Internet resources
CN101520798A (en) * 2009-03-06 2009-09-02 苏州锐创通信有限责任公司 Webpage classification technology based on vertical search and focused crawler
DE102013000611A1 (en) * 2013-01-16 2014-07-17 i-market GmbH Automatic method for recognizing brochures, catalogs or prospectus on websites of organizations, involves detecting and storing source code of to-be examined website by crawler or selecting source code completely or partially from database
CN105183749A (en) * 2015-07-13 2015-12-23 北京奇虎科技有限公司 Method and device for crawling promotion content and providing crawled promotion content for use in search

Also Published As

Publication number Publication date
CN106446068A (en) 2017-02-22

Similar Documents

Publication Publication Date Title
CN108206802B (en) Method and device for detecting webpage backdoor
US10778702B1 (en) Predictive modeling of domain names using web-linking characteristics
Niakanlahiji et al. Phishmon: A machine learning framework for detecting phishing webpages
US8260914B1 (en) Detecting DNS fast-flux anomalies
CN106446068B (en) Directory database generation and query method and device
US20190020683A1 (en) Automatic generation of low-interaction honeypots
CN106209488B (en) Method and device for detecting website attack
US20130263266A1 (en) Systems and methods for automated malware artifact retrieval and analysis
CN107347076B (en) SSRF vulnerability detection method and device
CN107896219B (en) Method, system and related device for detecting website vulnerability
JP5415390B2 (en) Filtering method, filtering system, and filtering program
CN106713318B (en) WEB site safety protection method and system
CN105635064B (en) CSRF attack detection method and device
US20210006592A1 (en) Phishing Detection based on Interaction with End User
Sorio et al. Detection of hidden fraudulent urls within trusted sites using lexical features
CN103793508B (en) A kind of loading recommendation information, the methods, devices and systems of network address detection
West et al. Metadata-driven threat classification of network endpoints appearing in malware
CN108768934B (en) Malicious program release detection method, device and medium
CN107784107B (en) Dark chain detection method and device based on escape behavior analysis
CN113849820A (en) Vulnerability detection method and device
Toffalini et al. Google dorks: Analysis, creation, and new defenses
CN110392032B (en) Method, device and storage medium for detecting abnormal URL
CN111581637A (en) SQL injection detection method, device, equipment and computer storage medium
Jo et al. You're not who you claim to be: Website identity check for phishing detection
Shahriar et al. Information source-based classification of automatic phishing website detectors

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant