CMS (content management system) recognition feature rule extraction method based on online website
Technical Field
The invention relates to the field of content management, in particular to a CMS (content management system) identification feature rule extraction method based on an online website.
Background
Content management system, english full name: content Management System, abbreviated CMS in English. The content management system is a system for managing and maintaining columns, contents and templates of a website by a programming language running on a server side. With the continuous development of the internet, the types of CMSs are more and more nowadays, developers do not need to develop a website from scratch, and the website can be quickly established only by downloading required open source website establishing programs from the internet, so that a large number of websites established by using the CMSs exist in the internet; in network security, identifying which CMS program is used by a website has an important influence on security testing work, and the workload in a security testing link can be greatly reduced by accurately identifying the CMS. The general method for identifying the web fingerprint of the website comprises the following steps: identifying whether the home page content contains a certain keyword, judging whether a certain page contains the certain keyword, judging whether md5 of a certain static file of the website is an expected value, and the like. The traditional method for collecting the web fingerprint rules comprises the following steps: determining that a certain website is a certain open source website building program, finding a certain static file as a file specific to the program, for example, a logo picture, js or css file of the website building program contains the name of the brand, that is, the file is a file specific to the program, and defining that the URL absolute path of the file is a characteristic file path of the website building program, the md5 value of the file is a characteristic value of the website building program, and the characteristic file path + the characteristic value + the website building program of the brand form a web fingerprint identification rule.
With the increasing variety of CMSs on the internet, how to quickly enrich and identify a rule base of web fingerprint identification becomes a key for improving the efficiency of web fingerprint identification, and a traditional method is to manually find a feature file path, but the efficiency is very low: feature files need to be searched from webpage source codes based on experience, most of the feature files are in a special directory of a website program, and the viewed webpage may not be loaded with the feature files; and the precision is poor: the feature files which are easy to find are not unique to the station building program, so that the false recognition rate is high.
Therefore, in the situation that CMSs are increasing, the existing rule base extraction method for CMS fingerprint identification cannot meet the requirement, and how to efficiently collect the web fingerprint rule is a problem to be urgently solved for accurately identifying the web fingerprint information of the website.
Disclosure of Invention
The technical problem to be solved by the invention is as follows: how to solve the problem that the conventional CMS fingerprint identification rule base extraction method cannot be met, how to efficiently collect the web fingerprint rules for accurately identifying the web fingerprint information of the website is a problem which needs to be solved urgently, and a CMS identification feature rule extraction method based on an online website is provided.
The invention solves the technical problems through the following technical scheme, and the invention comprises the following steps:
s1: acquiring a presentation site address from the Internet, collecting one or more presentation sites on a line of a certain CMS, and acquiring a CMS keyword list;
s2: crawling a static file address in page content of a presentation site, accessing and acquiring pages and links of the presentation site, wherein the accessed pages and links comprise but are not limited to pages or links formed by a site first page, pages formed by random character strings (404), site domain name concatenation/admin,/robots.txt,/README.md,/LICENSE.txt, and acquiring and extracting in-site page links and static file link addresses containing keywords in page content;
s3: extracting co-existing pictures with the same md5 value or files with keywords from the static files acquired by the multiple presentation sites, comparing the static links acquired by the step S2 of the multiple presentation sites transmitted by the step S1, finding out picture files with the same md5 value and js and css files with the same path and containing keyword information, and forming a result set R1;
s4: the multiple presentation sites mutually judge whether static file addresses of other presentation sites exist, sequentially check whether CMS feature sets of the example sites exist under the other example sites or not to form a result set R2, return the result sets R1 and R2, and then manually find representative rule files and paths in the result sets.
Preferably, the process of acquiring the home page information in step S2 is as follows: accessing an input demonstration site home page, acquiring page content, extracting a next-level page link address under the home page and all js, css and image link addresses, judging whether js and css files contain keyword information or not, judging the picture type and acquiring an md5 value;
the page 404 in step S2 includes: splicing an address of a page formed by connecting random characters in series with an input demonstration site, a domain name or an ip, accessing to obtain the content of a returned page, wherein the network state code is 404;
the "/admin" page in step S2 includes: the input demonstration site is spliced with "/admin" to form a new page address and access the new page address, whether the new page address can be normally opened or not is judged, and the returned network state code is 200;
the "/robots. txt" file link in said step S2 includes: input demonstration sites, domain names or ip concatenation "/keywords. txt" form static file link addresses and access, whether the files can be normally opened or not is judged, a returned network state code is 200, and whether the file contents contain keyword information or not is judged;
the "/README. md" file link in said step S2 includes: the input demonstration site, domain name or ip splicing "/README. md" forms a static file link address and accesses, judges whether the file can be normally opened and returns a network status code of 200, and judges whether the file content contains keyword information;
the "/LICENSE. txt" file link in said step S2 includes: the input demonstration site, domain name or ip splicing "/LICENSE.txt" forms a static file link address and accesses, whether the file can be normally opened and the returned network status code is 200 is judged, and whether the file content contains keyword information is judged;
summarizing the js link or suffix present in the above pages as.js, css link or suffix as.css, and the picture link suffixes include, but are not limited to: storing a result set formed by static files of jpg,. png,. jpeg,. ico and. gif into a first database, wherein the stored data content comprises the relative address of the static files, the image format file calculates the MD5 value, the js and css files search whether the text content has the keywords input in the step S1, and if the text content has the row content of the stored keywords.
Preferably, the specific steps of the comparison in S3 are as follows: the file comparison method comprises the following steps: and sequentially polling each presentation site to form a static file in a result set through S2, comparing the static file in the result set with static files in result sets obtained by other presentation sites through S2, judging whether the MD5 values of the static files in the picture types are the same, judging whether the js, css and txt static text files contain keyword information, and finally outputting pictures with the same MD5 values and text files with the keyword information in the same file path.
Preferably, the specific process of step S4 is as follows:
SS 1: aggregating CMS feature set file paths under all demonstration sites, and de-duplicating to form a set D1;
SS 2: sequentially traversing each site, and removing all feature set file paths under the currently traversed site from the set D1 formed by the SS 1; then splicing the rest paths of the aggregate files to the URL of the site to request access;
SS 3: judging a return value after the spliced URL accesses, and recording a URL which can be normally accessed and has a return status code of 200, wherein the URL comprises a picture address and js, css and txt static files;
the picture address is used for calculating whether the MD5 value of the picture is the same as the MD5 value of other sites, and if the MD5 values are all the same, the recorded data content includes the relative address of the picture file and the MD5 value;
the js, css and txt static files are used for judging whether keyword information exists in file contents or not, if yes, the keyword information is recorded, and the recorded data contents comprise the relative address of the text file, the keyword and the content of a row where the keyword is located;
the results recorded in step SS3 are merged to form result set R2, which is returned to result sets R1 and R2, where representative rule files and paths are then found.
Preferably, the keyword information is a cms name, or other keyword associated with the cms.
Compared with the prior art, the invention has the following advantages: according to the CMS identification feature rule extraction method based on the online website, under the condition of a passive code, a representative rule file is extracted in a mode of crawling one or more demonstration site page links, automatic collection of fingerprint identification rules is achieved, collection efficiency is improved, and rule identification accuracy is improved.
Drawings
FIG. 1 is an overall flow diagram of the present invention;
FIG. 2 is a flowchart of step S2 of the present invention;
FIG. 3 is a flowchart of step S3 of the present invention;
fig. 4 is a flowchart of step S4 of the present invention.
Detailed Description
The following examples are given for the detailed implementation and specific operation of the present invention, but the scope of the present invention is not limited to the following examples.
As shown in fig. 1 to 4, the present embodiment provides a technical solution: a CMS recognition feature rule extraction method based on an online website comprises the following steps:
s1: acquiring a demonstration site address from the Internet, collecting one or more demonstration sites on a certain CMS line, and acquiring a CMS keyword list;
s2: crawling a static file address in page content of a presentation site, accessing and acquiring pages and links of the presentation site, wherein the accessed pages and links comprise but are not limited to pages or links formed by a site first page, pages formed by random character strings (404), site domain name concatenation/admin,/robots.txt,/README.md,/LICENSE.txt, and acquiring and extracting in-site page links and static file link addresses containing keywords in page content;
s3: extracting co-existing pictures with the same md5 value or files with keywords from the static files acquired by the multiple presentation sites, comparing the static links acquired by the step S2 of the multiple presentation sites transmitted by the step S1, finding out picture files with the same md5 value and js and css files with the same path and containing keyword information, and forming a result set R1;
s4: the multiple presentation sites mutually judge whether static file addresses of other presentation sites exist, sequentially check whether CMS feature sets of the example sites exist under the other example sites or not to form a result set R2, return the result sets R1 and R2, and then manually find representative rule files and paths in the result sets.
The procedure of acquiring the home page information in step S2 is as follows: accessing an input first page of a demonstration site, acquiring page content, extracting a next-level page link address and all js, css and image link addresses under the first page, judging whether the js and css files contain keyword information or not, judging the picture type and acquiring an md5 value;
the page 404 in step S2 includes: splicing a page address formed by connecting random characters in series with an input presentation site, a domain name or an ip, accessing to obtain returned page contents, wherein a network status code is 404;
the "/admin" page in step S2 includes: the input demonstration site is spliced with "/admin" to form a new page address and access the new page address, whether the new page address can be normally opened or not is judged, and the returned network state code is 200;
the "/robots. txt" file link in said step S2 includes: input demonstration sites, domain names or ip concatenation "/keywords. txt" form static file link addresses and access, whether the files can be normally opened or not is judged, a returned network state code is 200, and whether the file contents contain keyword information or not is judged;
the "/README. md" file link in said step S2 includes: the input demonstration site, domain name or ip splicing "/README. md" forms a static file link address and accesses, judges whether the file can be normally opened and returns a network status code of 200, and judges whether the file content contains keyword information;
the "/LICENSE. txt" file link in said step S2 includes: the input demonstration site, domain name or ip splicing "/LICENSE.txt" forms a static file link address and accesses, whether the file can be normally opened and the returned network status code is 200 is judged, and whether the file content contains keyword information is judged;
summarizing the js link or suffix present in the above pages as.js, css link or suffix as.css, and the picture link suffixes include, but are not limited to: storing a result set formed by static files of jpg,. png,. jpeg,. ico and. gif into a first database, wherein the stored data content comprises the relative address of the static files, the image format file calculates the MD5 value, the js and css files search whether the text content has the keywords input in the step S1, and if the text content has the row content of the stored keywords.
The specific steps of the comparison in the step S3 are as follows: the file comparison method comprises the following steps: and sequentially polling each presentation site to form a static file in a result set through S2, comparing the static file in the result set with static files in result sets obtained by other presentation sites through S2, judging whether the MD5 values of the static files in the picture types are the same, judging whether the js, css and txt static text files contain keyword information, and finally outputting pictures with the same MD5 values and text files with the keyword information in the same file path.
The specific process of step S4 is as follows:
SS 1: aggregating CMS feature set file paths under all demonstration sites, and de-duplicating to form a set D1;
SS 2: sequentially traversing each site, and removing all feature set file paths under the currently traversed site from the set D1 formed by the SS 1; then splicing the rest paths of the aggregate file to the URL of the site to request access;
SS 3: judging a return value after the spliced URL accesses, and recording a URL which can be normally accessed and has a return status code of 200, wherein the URL comprises a picture address and js, css and txt static files;
the picture address is used for calculating whether the MD5 value of the picture is the same as the MD5 value of other sites, and if the MD5 values are the same, the recorded data content includes the relative address of the picture file and the MD5 value;
the js, css and txt static files are used for judging whether keyword information exists in file contents, if yes, the keyword information is recorded, and the recorded data contents comprise the relative address of the text file, the keyword and the content of the line where the keyword exists;
the results recorded in step SS3 are merged to form result set R2, which is returned to result sets R1 and R2, where representative rule files and paths are then found.
The keyword information is a cms name or other keywords associated with the cms.
Furthermore, the terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature. In the description of the present invention, "a plurality" means at least two, e.g., two, three, etc., unless specifically limited otherwise.
In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above are not necessarily intended to refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, various embodiments or examples and features of different embodiments or examples described in this specification can be combined and combined by one skilled in the art without contradiction.
Although embodiments of the present invention have been shown and described above, it is understood that the above embodiments are exemplary and should not be construed as limiting the present invention, and that variations, modifications, substitutions and alterations can be made to the above embodiments by those of ordinary skill in the art within the scope of the present invention.