CMS (content management system) identification feature rule extraction method based on online website
Technical Field
The invention relates to the field of content management, in particular to a CMS (content management system) identification feature rule extraction method based on an online website.
Background
Content management system, english full name: content Management System, abbreviated CMS in English. The content management system is a system for managing and maintaining columns, contents and templates of a website by a programming language running on a server side. With the continuous development of the internet, the types of CMSs are more and more nowadays, developers do not need to develop a website from scratch, and the website can be quickly established only by downloading required open source website establishing programs from the internet, so that a large number of websites established by using the CMSs exist in the internet; in network security, identifying which CMS program is used by a website has an important influence on security testing work, and the workload in a security testing link can be greatly reduced by accurately identifying the CMS. The general method for identifying the web fingerprint of the website comprises the following steps: identifying whether the home page content contains a certain keyword, judging whether a certain page contains the certain keyword, judging whether md5 of a certain static file of the website is an expected value, and the like. The traditional method for collecting the web fingerprint rules comprises the following steps: determining that a certain website is a certain open source website building program, finding a certain static file as a file specific to the program, such as a logo picture, js or css file of the website building program containing the name of the brand, that is, the file is a file specific to the program, and defining that the URL absolute path of the file is the path of the feature file of the website building program, the md5 value of the file is the feature value of the website building program, and the path of the feature file + the feature value + the website building program of the brand form a web fingerprint identification rule.
With the increasing variety of CMSs on the internet, how to quickly enrich and identify a rule base of web fingerprint identification becomes a key for improving the efficiency of web fingerprint identification, and a traditional method is to manually find a feature file path, but the efficiency is very low: feature files need to be searched from webpage source codes based on experience, most of the feature files are in a special directory of a website program, and the viewed webpage may not be loaded with the feature files; and the precision is poor: the feature files which are easy to find are not unique to the station building program, so that the false recognition rate is high.
Therefore, in the situation that CMSs are increasing, the existing rule base extraction method for CMS fingerprint identification cannot meet the requirement, and how to efficiently collect the web fingerprint rule for accurately identifying the web fingerprint information of the website is a problem which needs to be solved urgently.
Disclosure of Invention
The technical problem to be solved by the invention is as follows: how to solve the problem that the conventional CMS fingerprint identification rule base extraction method cannot be met, how to efficiently collect the web fingerprint rules for accurately identifying the web fingerprint information of the website is a problem which needs to be solved urgently, and a CMS identification feature rule extraction method based on an online website is provided.
The invention solves the technical problems through the following technical scheme, and the invention comprises the following steps:
s1: acquiring a demonstration site address from the Internet, collecting one or more demonstration sites on a certain CMS line, and acquiring a CMS keyword list;
s2: crawling a static file address in page content of a demonstration site, accessing and acquiring pages and links of the demonstration site, wherein the accessed pages and links comprise but are not limited to pages or links formed by a site first page, a 404 page formed by a random character string, site domain name concatenation/admin,/robots.txt,/README.md,/LICENSE.txt and the like, and acquiring and extracting in-site page links and static file link addresses containing keywords in page content;
s3: extracting pictures which exist together and have the same md5 value or files with keywords from the static files acquired by the multiple demonstration sites, comparing the static links acquired by the multiple demonstration sites through the step S2, finding out the pictures under the same path and having the same md5 value, and js and css files which have the same path and contain keyword information, and forming a result set R1;
s4: the multiple presentation sites mutually judge whether static file addresses of other presentation sites exist, sequentially check whether CMS feature sets of the example sites exist under the other example sites or not to form a result set R2, return the result sets R1 and R2, and then manually find representative rule files and paths in the result sets.
Preferably, the process of acquiring the home page information in step S2 is as follows: accessing an input first page of a demonstration site, acquiring page content, extracting a next-level page link address and all js, css and image link addresses under the first page, judging whether the js and css files contain keyword information or not, judging the picture type and acquiring an md5 value;
the page 404 in step S2 includes: splicing an address of a page formed by connecting random characters in series with an input demonstration site, a domain name or an ip, accessing to obtain the content of a returned page, wherein the network state code is 404;
the "/admin" page in step S2 includes: the input demonstration site is spliced with "/admin" to form a new page address and access the new page address, whether the new page address can be normally opened or not is judged, and the returned network state code is 200;
the "/robots. txt" file link in said step S2 includes: input demonstration sites, domain names or ip concatenation "/keywords. txt" form static file link addresses and access, whether the files can be normally opened or not is judged, a returned network state code is 200, and whether the file contents contain keyword information or not is judged;
the "/README. md" file link in said step S2 includes: the input demonstration site, domain name or ip splicing "/README. md" forms a static file link address and accesses, judges whether the file can be normally opened and returns a network status code of 200, and judges whether the file content contains keyword information;
the "/LICENSE. txt" file link in said step S2 includes: the input demonstration site, domain name or ip splicing "/LICENSE.txt" forms a static file link address and accesses, whether the file can be normally opened and the returned network status code is 200 is judged, and whether the file content contains keyword information is judged;
summarizing the js link or suffix present in the above pages as.js, css link or suffix as.css, and the picture link suffixes include, but are not limited to: and storing a result set formed by static files such as jpg, png, jpeg, ico and gif into a first database, wherein the stored data content comprises the relative address of the static files, the image format file calculates the MD5 value, and files such as js and css search whether the text content has the keywords input in the step S1 or not, and if yes, the content of the line where the stored keywords exist.
Preferably, the specific steps of the comparison in S3 are as follows: the file comparison method comprises the following steps: and sequentially polling each presentation site to form a static file in a result set through S2 and comparing the static file in the result set obtained by other presentation sites through S2, wherein the static file of the picture type needs to judge whether the MD5 values are the same, and the static text files such as js, css, txt and the like need to judge whether keyword information exists, and finally, pictures with the same MD5 values and text files with the keyword information exist in the same file path are output.
Preferably, the specific process of step S4 is as follows:
SS 1: aggregating CMS feature set file paths under all demonstration sites, and de-duplicating to form a set D1;
SS 2: sequentially traversing each site, and removing all feature set file paths under the currently traversed site from the set D1 formed by the SS 1; then splicing the rest paths of the aggregate file to the URL of the site to request access;
SS 3: judging a return value after the spliced URL accesses, and recording a URL which can be normally accessed and has a return status code of 200, wherein the URL comprises a picture address and static files such as js, css, txt and the like;
the picture address is used for calculating whether the MD5 value of the picture is the same as the MD5 value of other sites, if the MD5 values are the same, the recorded data content has the relative address of the picture file and the MD5 value;
the static files such as js, css, txt and the like are used for judging whether keyword information exists in file contents, if yes, the keyword information is recorded, and the recorded data contents comprise the relative address of the text file, the keyword and the content of the line where the keyword exists;
the results recorded in the step SS3 are merged to form a result set R2, the result sets R1 and R2 are returned, and then representative rule files and paths are found therein.
Preferably, the keyword information is a cms name, or other keyword associated with the cms.
Compared with the prior art, the invention has the following advantages: according to the CMS identification feature rule extraction method based on the online website, under the condition of a passive code, a representative rule file is extracted in a mode of crawling one or more demonstration site page links, automatic collection of fingerprint identification rules is achieved, collection efficiency is improved, and rule identification accuracy is improved.
Drawings
FIG. 1 is an overall flow diagram of the present invention;
FIG. 2 is a flowchart of step S2 of the present invention;
FIG. 3 is a flowchart of step S3 of the present invention;
fig. 4 is a flowchart of step S4 of the present invention.
Detailed Description
The following examples are given for the detailed implementation and specific operation of the present invention, but the scope of the present invention is not limited to the following examples.
As shown in fig. 1 to 4, the present embodiment provides a technical solution: a CMS recognition feature rule extraction method based on an online website comprises the following steps:
s1: acquiring a demonstration site address from the Internet, collecting one or more demonstration sites on a certain CMS line, and acquiring a CMS keyword list;
s2: crawling a static file address in page content of a demonstration site, accessing and acquiring pages and links of the demonstration site, wherein the accessed pages and links comprise but are not limited to pages or links formed by a site first page, a 404 page formed by a random character string, site domain name concatenation/admin,/robots.txt,/README.md,/LICENSE.txt and the like, and acquiring and extracting in-site page links and static file link addresses containing keywords in page content;
s3: extracting pictures which exist together and have the same md5 value or files with keywords from the static files acquired by the multiple demonstration sites, comparing the static links acquired by the multiple demonstration sites through the step S2, finding out the pictures under the same path and having the same md5 value, and js and css files which have the same path and contain keyword information, and forming a result set R1;
s4: the multiple presentation sites mutually judge whether static file addresses of other presentation sites exist, sequentially check whether CMS feature sets of the example sites exist under the other example sites or not to form a result set R2, return the result sets R1 and R2, and then manually find representative rule files and paths in the result sets.
The procedure of acquiring the home page information in step S2 is as follows: accessing an input first page of a demonstration site, acquiring page content, extracting a next-level page link address and all js, css and image link addresses under the first page, judging whether the js and css files contain keyword information or not, judging the picture type and acquiring an md5 value;
the page 404 in step S2 includes: splicing an address of a page formed by connecting random characters in series with an input demonstration site, a domain name or an ip, accessing to obtain the content of a returned page, wherein the network state code is 404;
the "/admin" page in step S2 includes: the input demonstration site is spliced with "/admin" to form a new page address and access the new page address, whether the new page address can be normally opened or not is judged, and the returned network state code is 200;
the "/robots. txt" file link in said step S2 includes: input demonstration sites, domain names or ip concatenation "/keywords. txt" form static file link addresses and access, whether the files can be normally opened or not is judged, a returned network state code is 200, and whether the file contents contain keyword information or not is judged;
the "/readem. md" file link in said step S2 includes: the input demonstration site, domain name or ip splicing "/README. md" forms a static file link address and accesses, judges whether the file can be normally opened and returns a network status code of 200, and judges whether the file content contains keyword information;
the "/LICENSE. txt" file link in said step S2 includes: the input demonstration site, domain name or ip splicing "/LICENSE.txt" forms a static file link address and accesses, whether the file can be normally opened and the returned network status code is 200 is judged, and whether the file content contains keyword information is judged;
summarizing the js link or suffix present in the above pages as.js, css link or suffix as.css, and the picture link suffixes include, but are not limited to: and storing a result set formed by static files such as jpg, png, jpeg, ico and gif into a first database, wherein the stored data content comprises the relative address of the static files, the image format file calculates the MD5 value, and files such as js and css search whether the text content has the keywords input in the step S1 or not, and if yes, the content of the line where the stored keywords exist.
The specific steps of the comparison in the step S3 are as follows: the file comparison method comprises the following steps: and sequentially polling each presentation site to form a static file in a result set through S2 and comparing the static file in the result set obtained by other presentation sites through S2, wherein the static file of the picture type needs to judge whether the MD5 values are the same, and the static text files such as js, css, txt and the like need to judge whether keyword information exists, and finally, pictures with the same MD5 values and text files with the keyword information exist in the same file path are output.
The specific process of step S4 is as follows:
SS 1: aggregating CMS feature set file paths under all demonstration sites, and de-duplicating to form a set D1;
SS 2: sequentially traversing each site, and removing all feature set file paths under the currently traversed site from the set D1 formed by the SS 1; then splicing the rest paths of the aggregate file to the URL of the site to request access;
SS 3: judging a return value after the spliced URL accesses, and recording a URL which can be normally accessed and has a return status code of 200, wherein the URL comprises a picture address and static files such as js, css, txt and the like;
the picture address is used for calculating whether the MD5 value of the picture is the same as the MD5 value of other sites, if the MD5 values are the same, the recorded data content has the relative address of the picture file and the MD5 value;
the static files such as js, css, txt and the like are used for judging whether keyword information exists in file contents, if yes, the keyword information is recorded, and the recorded data contents comprise the relative address of the text file, the keyword and the content of the line where the keyword exists;
the results recorded in the step SS3 are merged to form a result set R2, the result sets R1 and R2 are returned, and then representative rule files and paths are found therein.
The keyword information is a cms name, or other keyword associated with the cms.
Furthermore, the terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature. In the description of the present invention, "a plurality" means at least two, e.g., two, three, etc., unless specifically limited otherwise.
In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above are not necessarily intended to refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, various embodiments or examples and features of different embodiments or examples described in this specification can be combined and combined by one skilled in the art without contradiction.
Although embodiments of the present invention have been shown and described above, it is understood that the above embodiments are exemplary and should not be construed as limiting the present invention, and that variations, modifications, substitutions and alterations can be made to the above embodiments by those of ordinary skill in the art within the scope of the present invention.