CN111695075B - Website CMS (content management system) identification method and security vulnerability detection method and device - Google Patents

Website CMS (content management system) identification method and security vulnerability detection method and device Download PDF

Info

Publication number
CN111695075B
CN111695075B CN202010534459.5A CN202010534459A CN111695075B CN 111695075 B CN111695075 B CN 111695075B CN 202010534459 A CN202010534459 A CN 202010534459A CN 111695075 B CN111695075 B CN 111695075B
Authority
CN
China
Prior art keywords
cms
website
information
identified
list
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010534459.5A
Other languages
Chinese (zh)
Other versions
CN111695075A (en
Inventor
沈潇军
倪阳旦
沈志豪
蔡晴
娄佳
由奇林
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
State Grid Zhejiang Electric Power Co Ltd
Information and Telecommunication Branch of State Grid Zhejiang Electric Power Co Ltd
Original Assignee
State Grid Zhejiang Electric Power Co Ltd
Information and Telecommunication Branch of State Grid Zhejiang Electric Power Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by State Grid Zhejiang Electric Power Co Ltd, Information and Telecommunication Branch of State Grid Zhejiang Electric Power Co Ltd filed Critical State Grid Zhejiang Electric Power Co Ltd
Priority to CN202010534459.5A priority Critical patent/CN111695075B/en
Publication of CN111695075A publication Critical patent/CN111695075A/en
Application granted granted Critical
Publication of CN111695075B publication Critical patent/CN111695075B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/958Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/50Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
    • G06F21/57Certifying or maintaining trusted computer platforms, e.g. secure boots or power-downs, version controls, system software checks, secure updates or assessing vulnerabilities
    • G06F21/577Assessing vulnerabilities and evaluating computer system security

Abstract

The application provides a website CMS identification method, a security vulnerability detection method and a website CMS identification device. And the CMS classification model is obtained based on multiple feature training, so that the CMS classification model is trained more sufficiently, the accuracy of the CMS classification model in classification is ensured, and the accuracy of the CMS identification is further improved on the basis.

Description

Website CMS (content management system) identification method, security vulnerability detection method and device
Technical Field
The present application relates to the field of information security technologies, and in particular, to a website CMS identification method, a security vulnerability detection method, and a device.
Background
With the gradual development of the internet, more and more website builders choose to use CMS (content management system) to build websites, so there are a large number of websites built by using CMS in the internet. And the CMS used by the website is identified, so that the safety test work of the website is facilitated.
Currently, CMSs are generally identified by using a feature fingerprint database constructed based on human. However, the accuracy of identifying the CMS by using the artificially constructed feature fingerprint database is difficult to meet the precision requirement.
Disclosure of Invention
In order to solve the above technical problems, embodiments of the present application provide a method for identifying a CMS of a website, a method for detecting a security vulnerability, and an apparatus for detecting a CMS of a website, so as to achieve the purpose of improving accuracy of CMS identification, where the technical scheme is as follows:
a method for identifying CMS of a website, the method comprising:
extracting static information of a first set type from a source code program of a CMS to be identified, and taking the static information as a white box feature;
extracting dynamic information of a second set type from the access information of the website to which the CMS to be identified belongs, and taking the dynamic information as black box characteristics;
inputting the white-box features and the black-box features into a pre-trained CMS classification model to obtain a classification result list output by the CMS classification model, wherein the classification result list is used as a first recognition result list and comprises a plurality of classification results, and the CMS classification model is obtained by utilizing white-box features and black-box features extracted by CMSs based on different types of websites in advance through training;
identifying the CMS to be identified based on a fingerprint database rule matching algorithm to obtain a second identification result;
identifying the CMS to be identified based on the first identification result list and the second identification result.
Preferably, the identifying the CMS to be identified based on a fingerprint library rule matching algorithm to obtain a second identification result includes:
acquiring information representing the identity of the website to be identified;
judging whether CMS type information matched with the information representing the identity of the website to be identified exists in pre-constructed fingerprint library information or not;
and if so, taking the CMS type information matched with the information representing the identity of the website to be identified as a second identification result.
Preferably, the identifying the CMS to be identified based on the first identification result list and the second identification result includes:
determining decision weights of all classification results in the first recognition result list and decision weights of the second recognition results;
under the condition that each classification result in the first recognition result list is different from the second recognition result, taking a result corresponding to the maximum value in the decision weight as a recognition result of the CMS to be recognized;
adding the decision weight of the classification result in the first recognition result list, which is the same as the second recognition result, to the decision weight of the second recognition result when the classification result in the first recognition result list is the same as the second recognition result, and taking the added decision weight as a target decision weight;
and taking the result corresponding to the maximum value in the target decision weight and the decision weight of the classification result different from the second identification result in the first identification result list as the identification result of the CMS to be identified.
Preferably, the static information of the first setting type includes:
a target path tree and a static resource list;
the extracting of the dynamic information of the second setting type from the access information of the website to which the CMS to be identified belongs, taking the dynamic information as a black box feature, includes:
extracting HTTP return header content, HTTP content keywords and crawler protocol file content from the access information of the website to which the CMS to be identified belongs;
extracting path features from access information of the website to which the CMS to be identified belongs based on the target path tree, wherein the path features are the number of first URLs, the first URLs are generated based on the directory path tree, and the website to be identified can be successfully accessed based on the first URLs;
extracting static resource loading characteristics from the access information of the website to which the CMS to be identified belongs based on the static resource list, wherein the static resource loading characteristics are the number of the same static resources in a static resource log in the access information of the website to which the CMS to be identified belongs as in the static resource list;
and taking the HTTP return header content, the HTTP content keywords, the crawler protocol file content, the path characteristics and the static resource loading characteristics as black box characteristics.
A security vulnerability detection method of a website CMS comprises the following steps:
extracting static information of a first set type from a source code program of a CMS to be identified, and taking the static information as a white box feature;
extracting dynamic information of a second set type from the access information of the website to which the CMS to be identified belongs on the basis of the white-box feature, and taking the dynamic information as a black-box feature;
inputting the white-box features and the black-box features into a pre-trained CMS classification model to obtain a classification result list output by the CMS classification model, and using the classification result list as a first recognition result list, wherein the classification result list comprises a plurality of classification results, and the CMS classification model is obtained by utilizing the white-box features and the black-box features extracted by CMSs based on different types of websites in advance;
identifying the CMS to be identified based on a fingerprint database rule matching algorithm to obtain a second identification result;
identifying the CMS to be identified based on the first identification result list and the second identification result;
acquiring website asset information based on the result of identifying the CMS to be identified;
and performing security vulnerability detection on the CMS to be identified based on the website asset information and a preset vulnerability database.
A website CMS identifying apparatus, comprising:
the first extraction module is used for extracting static information of a first set type from a source code program of the CMS to be identified, and taking the static information as white box characteristics;
the second extraction module is used for extracting dynamic information of a second set type from the access information of the website to which the CMS to be identified belongs, and taking the dynamic information as black box characteristics;
the first recognition module is used for inputting the white-box features and the black-box features into a pre-trained CMS classification model to obtain a classification result list output by the CMS classification model, and using the classification result list as a first recognition result list, wherein the classification result list comprises a plurality of classification results, and the CMS classification model is obtained by pre-training the white-box features and the black-box features extracted by CMS based on different types of websites;
the second identification module is used for identifying the CMS to be identified based on a fingerprint database rule matching algorithm to obtain a second identification result;
a third identifying module, configured to identify the CMS to be identified based on the first identification result list and the second identification result.
Preferably, the second identification module is specifically configured to:
acquiring information representing the identity of the website to be identified;
judging whether CMS type information matched with the information representing the identity of the website to be identified exists in pre-constructed fingerprint library information or not;
and if so, taking the CMS type information matched with the information representing the identity of the website to be identified as a second identification result.
Preferably, the third identifying module is specifically configured to:
determining decision weights of all classification results in the first recognition result list and decision weights of the second recognition results;
under the condition that each classification result in the first recognition result list is different from the second recognition result, taking a result corresponding to the maximum value in the decision weight as a recognition result of the CMS to be recognized;
when the classification result which is the same as the second identification result exists in each classification result in the first identification result list, adding the decision weight of the classification result which is the same as the second identification result in the first identification result list with the decision weight of the second identification result, and taking the added decision weight as a target decision weight;
and taking the result corresponding to the maximum value in the target decision weight and the decision weight of the classification result different from the second identification result in the first identification result list as the identification result of the CMS to be identified.
Preferably, the static information of the first setting type includes:
a target path tree and a static resource list;
the second extraction module is specifically configured to:
extracting HTTP return header content, HTTP content keywords and crawler protocol file content from the access information of the website to which the CMS to be identified belongs;
extracting path features from the access information of the website to which the CMS to be identified belongs based on the target path tree, wherein the path features are the number of first URLs, the first URLs are generated based on the directory path tree, and the website to be identified can be successfully accessed based on the first URLs;
extracting static resource loading characteristics from the access information of the website to which the CMS to be identified belongs based on the static resource list, wherein the static resource loading characteristics are the number of the same static resources in a static resource log in the access information of the website to which the CMS to be identified belongs as in the static resource list;
and taking the HTTP return header content, the HTTP content keywords, the crawler protocol file content, the path characteristics and the static resource loading characteristics as black box characteristics.
A security vulnerability detection apparatus of a website CMS, comprising:
the first extraction module is used for extracting static information of a first set type from a source code program of the CMS to be identified, and taking the static information as white box characteristics;
the second extraction module is used for extracting dynamic information of a second set type from the access information of the website to which the CMS to be identified belongs, and taking the dynamic information as black box characteristics;
the first recognition module is used for inputting the white-box features and the black-box features into a pre-trained CMS classification model to obtain a classification result list output by the CMS classification model, and using the classification result list as a first recognition result list, wherein the classification result list comprises a plurality of classification results, and the CMS classification model is obtained by pre-training the white-box features and the black-box features extracted by CMS based on different types of websites;
the second identification module is used for identifying the CMS to be identified based on a fingerprint database rule matching algorithm to obtain a second identification result;
a third identifying module, configured to identify the CMS to be identified based on the first identifying result list and the second identifying result;
an obtaining module, configured to obtain website asset information based on a result of identifying the CMS to be identified;
and the detection module is used for carrying out security vulnerability detection on the CMS to be identified based on the website asset information and a preset vulnerability database.
Compared with the prior art, the beneficial effects of this application do:
in the application, the CMS to be identified is identified by combining the CMS classification model and the fingerprint database rule matching algorithm, so that the identification accuracy of the CMS can be improved.
Moreover, the CMS classification model is obtained based on multiple feature (namely white box features and black box features) training, so that the CMS classification model is trained more fully, the accuracy of the CMS classification model in classification is ensured, and the accuracy of the CMS identification is further improved on the basis.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without inventive labor.
Fig. 1 is a flowchart of a method for identifying CMS of a website according to embodiment 1 of the present application;
fig. 2 is a flowchart of a method for identifying CMS of a website according to embodiment 2 of the present application;
fig. 3 is a schematic diagram illustrating a specific scenario of determining a recognition result according to the present application;
fig. 4 is a flowchart of a method for identifying CMS of a website according to embodiment 3 of the present application;
fig. 5 is a flowchart of a security vulnerability detection method of a CMS of a website provided by the present application;
fig. 6 is a schematic diagram of a logical structure of a CMS identification apparatus for a website provided by the present application;
fig. 7 is a schematic diagram of a logical structure of a security vulnerability detection apparatus of a website CMS according to the present application.
Detailed Description
The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments in the present application without making any creative effort belong to the protection scope of the present application.
In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in further detail below.
Referring to fig. 1, a flowchart of a website CMS identification method provided in embodiment 1 of the present application is shown in fig. 1, and the method may include, but is not limited to, the following steps:
step S11, extracting static information of a first set type from a source code program of the CMS to be identified, and taking the static information as white box characteristics.
The first setting type may be set as needed, and is not limited in this embodiment.
Static information, which can be understood as: information that does not change, such as static directory paths, can be determined from the source code program of the CMS to be identified.
And S12, extracting dynamic information of a second set type from the access information of the website to which the CMS to be identified belongs, and taking the dynamic information as black box characteristics.
The second setting type may be set as needed, and is not limited in this embodiment.
Dynamic information, which can be understood as: as the website to be identified is accessed, changed information, such as HTTP information, may occur.
In this embodiment, black box features corresponding to different CMSs to be identified may be extracted in a distributed asynchronous manner by using an asynchronous distributed framework Celery of Python. The working mode of Celery is a Master process responsible for management and a plurality of Worker processes responsible for executing specific tasks. And the Master process monitors whether a new task is submitted in the task list, if the new submitted task exists, the Worker process is created to execute specific task contents, and each Worker process is independent. In this embodiment, a Worker process is created for each submitted website to which the CMS to be identified belongs to independently execute the Selenium, so as to start a browser kernel driver to simulate the normal access behavior of the user, and collect feature information during access, thereby greatly improving the feature extraction efficiency.
And S13, inputting the white box features and the black box features into a pre-trained CMS classification model to obtain a classification result list output by the CMS classification model, and taking the classification result list as a first recognition result list.
The classification result list comprises a plurality of classification results, and the CMS classification model is obtained by utilizing CMS based on different types of websites and training extracted white-box features and black-box features in advance.
Based on different types of website CMSs, the extraction process of the extracted white-box features may include:
and respectively extracting static information of a first set type from the source code programs of CMSs to be identified, to which the different types of website CMSs belong, and taking the extracted static information as white box characteristics.
Based on different types of website CMSs, the extraction process of the extracted black box features may include:
and respectively extracting dynamic information of a second set type from the source code programs of CMSs to be identified, which belong to different types of website CMSs, and taking the extracted dynamic information as black box characteristics.
In this embodiment, the CMS classification model may be, but is not limited to: MLP multi-layer perceptron binary classification model.
And S14, identifying the CMS to be identified based on a fingerprint database rule matching algorithm to obtain a second identification result.
In this embodiment, the identifying the CMS to be identified based on the fingerprint database rule matching algorithm to obtain a second identification result may include:
and S141, obtaining information representing the identity of the website to be identified.
The information characterizing the identity of the website to be identified may include, but is not limited to: the special file md5 value of the website to be identified obtained through the Selenium, and the keyword or URL keyword responding to the main body content or header information.
And S142, judging whether CMS type information matched with the information representing the identity of the website to be identified exists in a pre-constructed fingerprint library.
The pre-constructed fingerprint library at least comprises the mapping relation of the characteristic information representing the website identity and the CMS type information of the website.
If yes, go to step S143.
S143, the CMS type information matched with the information representing the identity of the website to be identified is used as a second identification result.
And S15, identifying the CMS to be identified based on the first identification result list and the second identification result.
In this embodiment, the CMS to be identified is identified based on the first identification result list and the second identification result, and compared with the CMS to be identified based on the first identification result list or the second identification result, the accuracy is high.
In the application, the CMS to be identified is identified by combining the CMS classification model and the fingerprint database rule matching algorithm, so that the identification accuracy of the CMS can be improved.
Moreover, the CMS classification model is obtained based on multiple feature (namely white box features and black box features) training, so that the CMS classification model is trained more sufficiently, the accuracy of the CMS classification model in classification is ensured, and the accuracy of CMS identification is further improved finally on the basis.
As another alternative embodiment of the present application, referring to fig. 2, there is provided a flowchart of a website CMS identification method in embodiment 2 of the present application, where this embodiment mainly relates to a refinement of the website CMS identification method described in embodiment 1 above, as shown in fig. 2, the method may include, but is not limited to, the following steps:
and S21, extracting static information of a first set type from a source code program of the CMS to be identified, and taking the static information as a white box feature.
S22, extracting dynamic information of a second set type from the access information of the website to which the CMS to be identified belongs, and taking the dynamic information as black box characteristics;
and S23, inputting the white box features and the black box features into a pre-trained CMS classification model to obtain a classification result list output by the CMS classification model, and taking the classification result list as a first recognition result list.
The classification result list comprises a plurality of classification results, and the CMS classification model is obtained by utilizing CMS based on different types of websites and training extracted white-box features and black-box features in advance.
And S24, identifying the CMS to be identified based on a fingerprint database rule matching algorithm to obtain a second identification result.
The detailed procedures of steps S21 to S24 can be referred to the related descriptions of steps S11 to S14 in embodiment 1, and are not described herein again.
And S25, determining the decision weight of each classification result in the first recognition result list and the decision weight of the second recognition result.
In this embodiment, the decision weight of each classification result in the first recognition result list may be determined by, for example, the following relationship:
Figure BDA0002536546210000101
/>
wherein W represents the initial weight of the CMS classification model, accuracya represents the accuracy of the CMS classification model, and W represents the decision weight of each classification result in the first recognition result list.
The decision weight of the second recognition result can be set as required.
Step S26, taking a result corresponding to the maximum value in the decision weight as the recognition result of the CMS to be recognized when each classification result in the first recognition result list is different from the second recognition result.
Step S27, in a case that the classification result in the first recognition result list is the same as the classification result in the second recognition result, adding the decision weight of the classification result in the first recognition result list, which is the same as the classification result in the second recognition result, to the decision weight of the second recognition result, and taking the added decision weight as a target decision weight.
Step S28, using a result corresponding to the maximum value of the target decision weight and the decision weight of the classification result different from the second recognition result in the first recognition result list as the recognition result of the CMS to be recognized.
Now, the steps S27 to S28 are described by way of example, for example, as shown in fig. 3, the MLP multi-layer perceptron binary classification model obtains se:Sup>A plurality of classification results, which are CMS-se:Sup>A, CMS-B, and CMS-C, respectively, based on the fingerprint library rule matching algorithm, obtains se:Sup>A second recognition result, which is CMS-se:Sup>A, adds the decision weights of the two CMS-as to obtain se:Sup>A target decision weight, compares the target decision weight, the decision weight of CMS-B, and the decision weight of CMS-C, and if the comparison result is that the target decision weight is the largest, takes CMS-se:Sup>A as the recognition result of the CMS to be recognized.
Steps S25 to S28 are a specific implementation of step S15 in example 1.
As another alternative embodiment of the present application, referring to fig. 4, there is provided a flowchart of a website CMS identification method according to embodiment 3 of the present application, where this embodiment mainly relates to a refinement of the website CMS identification method described in the foregoing embodiment 1, as shown in fig. 4, the method may include, but is not limited to, the following steps:
step S31, extracting a target path tree and a static resource list from a source code program of the CMS to be identified, and taking the target path tree and the static resource list as white-box features.
The process of extracting the target path tree from the source code program of the CMS to be identified may include: on the basis of obtaining the source code program of the CMS to be identified, scanning and traversing CMS directories in the source code program of the CMS to be identified through a depth-first traversal algorithm in a recursion mode, performing layer-by-layer numbering storage according to the depths of the directories, and finally generating a target path tree.
And carrying out hash operation on the files under the CMS directory and storing hash information in the traversal process. And performing serialization operation on the target path tree and the files under the target path tree by using a Python's folder library to obtain serialized files and storing the serialized files into a system so as to query and call the target path tree and the files under the target path tree at any time in the following process.
The target path tree and the file row serialization operation under the target path tree can be understood as follows: and converting the target path tree and the file under the target path tree into binary data streams.
The process of extracting the target path tree and the static resource list from the source code program of the CMS to be identified may include:
acquiring all CMS source code files from the source code program of the CMS to be identified;
scanning and traversing all the acquired CMS source code files to obtain a list containing suffixes of all the files;
removing the suffixes in the list containing the suffixes of all the files, and removing the suffixes such as php, html and the like to obtain a new suffix list;
performing deserialization operation on the serialized file to obtain a file under a target path tree;
and extracting static resources corresponding to each suffix in the new suffix list from the files under the target path tree, and forming the extracted static resources into a static resource list.
Step S31 is a specific implementation manner of step S11 in example 1.
And step S32, extracting HTTP return header content, HTTP content keywords and crawler protocol file content from the access information of the website to which the CMS to be identified belongs.
Step S33, based on the target path tree, extracting path features from the access information of the website to which the CMS to be identified belongs, where the path features are the number of first URLs, and the first URLs are generated based on the directory path tree and can successfully access the website to be identified based on the first URLs.
Step S34, based on the static resource list, extracting static resource loading characteristics from the access information of the website to which the CMS to be identified belongs, where the static resource loading characteristics are the same number of static resources in a static resource log in the access information of the website to which the CMS to be identified belongs as in the static resource list.
And step S35, taking the HTTP return header content, the HTTP content keywords, the crawler protocol file content, the path characteristics and the static resource loading characteristics as black box characteristics.
Steps S32 to S35 are a specific implementation of step S12 in example 1.
And S36, inputting the white box features and the black box features into a pre-trained CMS classification model to obtain a classification result list output by the CMS classification model, and taking the classification result list as a first recognition result list.
The classification result list comprises a plurality of classification results, and the CMS classification model is obtained by training extracted white-box features and black-box features by utilizing CMSs based on different types of websites in advance.
And S37, identifying the CMS to be identified based on a fingerprint database rule matching algorithm to obtain a second identification result.
And step S38, identifying the CMS to be identified based on the first identification result list and the second identification result.
The detailed process of steps S36-S38 can be referred to the related description of steps S13-S15 in embodiment 1, and will not be described herein again.
In another embodiment of the present application, a method for detecting a security vulnerability of a CMS of a website is provided, please refer to fig. 5, and the method may include the following steps:
step S41, extracting static information of a first set type from a source code program of the CMS to be identified, and taking the static information as white box characteristics;
step S42, extracting dynamic information of a second set type from the access information of the website to which the CMS to be identified belongs based on the white-box feature, and taking the dynamic information as a black-box feature;
step S43, inputting the white-box features and the black-box features into a pre-trained CMS classification model to obtain a classification result list output by the CMS classification model, taking the classification result list as a first recognition result list, wherein the classification result list comprises a plurality of classification results, and the CMS classification model is obtained by utilizing the white-box features and the black-box features extracted by different types of website CMSs in advance;
step S44, identifying the CMS to be identified based on a fingerprint database rule matching algorithm to obtain a second identification result;
and step S45, identifying the CMS to be identified based on the first identification result list and the second identification result.
The detailed procedures of steps S41 to S45 can be referred to the related descriptions of steps S11 to S15 in embodiment 1, and are not described herein again.
And S46, acquiring the asset information of the website based on the identification result of the CMS to be identified.
The website asset information may include, but is not limited to: middleware information, CDN information, IP segment information, and operating system information.
And S47, carrying out security vulnerability detection on the CMS to be identified based on the website asset information and a preset vulnerability database.
And performing user interactive security vulnerability detection on the CMS to be identified based on the website asset information and a preset vulnerability database.
Based on the website asset information and a preset vulnerability database, performing user-interactive security vulnerability detection on the CMS to be identified, which can be understood as follows:
and displaying the asset information of the website through the content of the Web page, and performing security vulnerability detection on the CMS to be identified based on a vulnerability self-service detection function (which is provided by ttyd and supports Web end interaction and comprises a preset vulnerability database) provided by ttyd.
ttyd can be understood as: a tool capable of projecting command line terminals to a browser through a Web service has the following characteristics:
1. the Libwebsockets library development based on C has high performance;
2. js full-function terminal based on Xterm, supports CJK and IME;
3. supporting SSL based on OpenSSL;
4. any command can be run using the option;
5. support for rights validation and other customization options;
6. cross-platform compatibility: macOS, linux, freeBSD/OpenBSD, openWrt/LEDE, and Windows.
By the adoption of the Web-end-form vulnerability self-service detection, smooth transition of security vulnerability detection is facilitated by acquiring the website asset information and utilizing the information, and operation convenience and experience of security maintainers are improved.
In the application, the CMS to be identified is identified by combining the CMS classification model and the fingerprint database rule matching algorithm, so that the identification accuracy of the CMS can be improved. Moreover, the CMS classification model is obtained based on multiple feature (namely white box features and black box features) training, so that the CMS classification model is trained more fully, the accuracy of the CMS classification model in classification is ensured, and the accuracy of the CMS identification is further improved on the basis.
On the basis of improving the accuracy of the CMS identification result, website asset information is acquired based on the identification result of the CMS to be identified, the accuracy of acquiring the website asset information can be improved, and the accuracy of detecting security vulnerabilities of the CMS can be improved.
Next, the website CMS identification apparatus provided in the present application will be described, and the website CMS identification apparatus described below and the website CMS identification method described above may be referred to in correspondence.
Referring to fig. 6, the website CMS identifying apparatus includes: a first extraction module 11, a second extraction module 12, a first recognition module 13, a second recognition module 14 and a third recognition module 15.
The first extraction module 11 is configured to extract static information of a first setting type from a source code program of the CMS to be identified, and use the static information as a white-box feature.
And a second extracting module 12, configured to extract dynamic information of a second setting type from the access information of the website to which the CMS to be identified belongs, and use the dynamic information as a black box feature.
The first recognition module 13 is configured to input the white-box features and the black-box features into a previously trained CMS classification model, obtain a classification result list output by the CMS classification model, and use the classification result list as a first recognition result list, where the classification result list includes a plurality of classification results, and the CMS classification model is obtained by using previously extracted white-box features and black-box features based on different types of websites CMS.
And the second identification module 14 is configured to identify the CMS to be identified based on a fingerprint database rule matching algorithm, so as to obtain a second identification result.
A third identifying module 15, configured to identify the CMS to be identified based on the first identifying result list and the second identifying result.
In this embodiment, the second identifying module 14 may be specifically configured to:
acquiring information representing the identity of the website to be identified;
judging whether CMS type information matched with the information representing the identity of the website to be identified exists in pre-constructed fingerprint library information or not;
and if so, taking the CMS type information matched with the information representing the identity of the website to be identified as a second identification result.
The third identifying module 15 may specifically be configured to:
determining decision weights of all classification results in the first recognition result list and decision weights of the second recognition results;
under the condition that each classification result in the first recognition result list is different from the second recognition result, taking a result corresponding to the maximum value in the decision weight as a recognition result of the CMS to be recognized;
adding the decision weight of the classification result in the first recognition result list, which is the same as the second recognition result, to the decision weight of the second recognition result when the classification result in the first recognition result list is the same as the second recognition result, and taking the added decision weight as a target decision weight;
and taking the result corresponding to the maximum value in the target decision weight and the decision weight of the classification result different from the second identification result in the first identification result list as the identification result of the CMS to be identified.
In this embodiment, the static information of the first setting type may include:
a target path tree and a static resource list;
the second extraction module 12 may be specifically configured to:
extracting HTTP return header content, HTTP content keywords and crawler protocol file content from the access information of the website to which the CMS to be identified belongs;
extracting path features from access information of the website to which the CMS to be identified belongs based on the target path tree, wherein the path features are the number of first URLs, the first URLs are generated based on the directory path tree, and the website to be identified can be successfully accessed based on the first URLs;
extracting static resource loading characteristics from the access information of the website to which the CMS to be identified belongs based on the static resource list, wherein the static resource loading characteristics are the number of the same static resources in a static resource log in the access information of the website to which the CMS to be identified belongs as in the static resource list;
and taking the HTTP return header content, the HTTP content keywords, the crawler protocol file content, the path characteristics and the static resource loading characteristics as black box characteristics.
Next, a security vulnerability detection apparatus of the website CMS provided by the present application will be described, and the security vulnerability detection apparatus of the website CMS described below and the security vulnerability detection method of the website CMS described above may be referred to each other.
Referring to fig. 7, the security vulnerability detection apparatus of the website CMS includes: a first extraction module 21, a second extraction module 22, a first identification module 23, a second identification module 24, a third identification module 25, an acquisition module 26 and a detection module 27.
The first extraction module 21 is configured to extract static information of a first setting type from the source code program of the CMS to be identified, and use the static information as a white-box feature.
And a second extracting module 22, configured to extract dynamic information of a second setting type from the access information of the website to which the CMS to be identified belongs, and use the dynamic information as a black box feature.
The first recognition module 23 is configured to input the white-box features and the black-box features into a previously trained CMS classification model, obtain a classification result list output by the CMS classification model, and use the classification result list as a first recognition result list, where the classification result list includes a plurality of classification results, and the CMS classification model is obtained by using previously extracted white-box features and black-box features based on different types of websites CMS and training.
And the second identification module 24 is configured to identify the CMS to be identified based on a fingerprint database rule matching algorithm, so as to obtain a second identification result.
A third identifying module 25, configured to identify the CMS to be identified based on the first identifying result list and the second identifying result.
For the first extraction module 21, the second extraction module 22, the first identification module 23, the second identification module 24, and the third identification module 25, reference may be made to the related descriptions of the first extraction module 11, the second extraction module 12, the first identification module 13, the second identification module 14, and the third identification module 15, which are not described herein again.
An obtaining module 26, configured to obtain website asset information based on a result of identifying the CMS to be identified;
and the detection module 27 is configured to perform security vulnerability detection on the CMS to be identified based on the website asset information and a preset vulnerability database.
It should be noted that each embodiment is mainly described as a difference from the other embodiments, and the same and similar parts between the embodiments may be referred to each other. For the device-like embodiment, since it is basically similar to the method embodiment, the description is simple, and reference may be made to the partial description of the method embodiment for relevant points.
Finally, it should also be noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrases "comprising a," "...," or "comprising" does not exclude the presence of additional like elements in a process, method, article, or apparatus that comprises the element.
For convenience of description, the above devices are described as being divided into various units by function, respectively. Of course, the functionality of the units may be implemented in one or more software and/or hardware when implementing the present application.
From the above description of the embodiments, it is clear to those skilled in the art that the present application can be implemented by software plus a necessary general hardware platform. Based on such understanding, the technical solutions of the present application may be essentially or partially implemented in the form of a software product, which may be stored in a storage medium, such as a ROM/RAM, a magnetic disk, an optical disk, etc., and includes several instructions for enabling a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method according to the embodiments or some parts of the embodiments of the present application.
The webshell script detection method and device provided by the application are introduced in detail, specific examples are applied in the method to explain the principle and the implementation mode of the application, and the description of the embodiments is only used for helping to understand the method and the core idea of the application; meanwhile, for a person skilled in the art, according to the idea of the present application, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present application.

Claims (8)

1. A method for CMS identification of a website, the method comprising:
extracting static information of a first setting type from a source code program of a CMS to be identified, wherein the static information is used as a white-box feature, and the static information of the first setting type comprises the following steps: a target path tree and a static resource list;
extracting dynamic information of a second set type from the access information of the website to which the CMS to be identified belongs, and taking the dynamic information as black box characteristics; the extracting of the dynamic information of the second setting type from the access information of the website to which the CMS to be identified belongs, taking the dynamic information as a black box feature, includes: extracting HTTP return header content, HTTP content keywords and crawler protocol file content from the access information of the website to which the CMS to be identified belongs; extracting path features from the access information of the website to which the CMS to be identified belongs based on the target path tree, wherein the path features are the number of first URLs, the first URLs are generated based on the target path tree, and the website to which the CMS to be identified belongs can be successfully accessed based on the first URLs; extracting static resource loading characteristics from the access information of the website to which the CMS to be identified belongs based on the static resource list, wherein the static resource loading characteristics are the number of the same static resources in a static resource log in the access information of the website to which the CMS to be identified belongs as in the static resource list; taking the HTTP return header content, the HTTP content keywords, the crawler protocol file content, the path characteristics and the static resource loading characteristics as black box characteristics;
inputting the white-box features and the black-box features into a pre-trained CMS classification model to obtain a classification result list output by the CMS classification model, wherein the classification result list is used as a first recognition result list and comprises a plurality of classification results, and the CMS classification model is obtained by utilizing white-box features and black-box features extracted by CMSs based on different types of websites in advance through training;
identifying the CMS to be identified based on a fingerprint database rule matching algorithm to obtain a second identification result;
and identifying the CMS to be identified based on the first identification result list and the second identification result.
2. The method of claim 1, wherein the identifying the CMS to be identified based on the rule matching algorithm to obtain a second identification result, comprising:
acquiring information representing the identity of the website to which the CMS to be identified belongs;
judging whether CMS type information matched with the information representing the identity of the website to which the CMS to be identified belongs exists in pre-constructed fingerprint library information;
and if so, using the CMS type information matched with the information representing the identity of the website to which the CMS to be identified belongs as a second identification result.
3. The method of claim 1, wherein the identifying the CMS to be identified based on the first list of identification results and the second identification result comprises:
determining decision weights of all classification results in the first recognition result list and decision weights of the second recognition results;
under the condition that each classification result in the first recognition result list is different from the second recognition result, taking a result corresponding to the maximum value in the decision weight as a recognition result of the CMS to be recognized;
adding the decision weight of the classification result in the first recognition result list, which is the same as the second recognition result, to the decision weight of the second recognition result when the classification result in the first recognition result list is the same as the second recognition result, and taking the added decision weight as a target decision weight;
and taking the result corresponding to the maximum value in the target decision weight and the decision weight of the classification result different from the second identification result in the first identification result list as the identification result of the CMS to be identified.
4. A security vulnerability detection method of a website CMS is characterized by comprising the following steps:
extracting static information of a first setting type from a source code program of a CMS to be identified, and using the static information as a white-box feature, wherein the static information of the first setting type comprises the following steps: a target path tree and a static resource list;
extracting dynamic information of a second set type from the access information of the website to which the CMS to be identified belongs on the basis of the white-box feature, and taking the dynamic information as a black-box feature; the extracting of the dynamic information of the second setting type from the access information of the website to which the CMS to be identified belongs, taking the dynamic information as a black box feature, includes: extracting HTTP return header content, HTTP content keywords and crawler protocol file content from the access information of the website to which the CMS to be identified belongs; extracting path features from the access information of the website to which the CMS to be identified belongs based on the target path tree, wherein the path features are the number of first URLs, the first URLs are generated based on the target path tree, and the website to which the CMS to be identified belongs can be successfully accessed based on the first URLs; extracting static resource loading characteristics from the access information of the website to which the CMS to be identified belongs based on the static resource list, wherein the static resource loading characteristics are the number of the same static resources in a static resource log in the access information of the website to which the CMS to be identified belongs as in the static resource list; taking the HTTP return header content, the HTTP content keywords, the crawler protocol file content, the path characteristics and the static resource loading characteristics as black box characteristics;
inputting the white-box features and the black-box features into a pre-trained CMS classification model to obtain a classification result list output by the CMS classification model, wherein the classification result list is used as a first recognition result list and comprises a plurality of classification results, and the CMS classification model is obtained by utilizing white-box features and black-box features extracted by CMSs based on different types of websites in advance through training;
identifying the CMS to be identified based on a fingerprint database rule matching algorithm to obtain a second identification result;
identifying the CMS to be identified based on the first identification result list and the second identification result;
acquiring website asset information based on the result of identifying the CMS to be identified;
and performing security vulnerability detection on the CMS to be identified based on the website asset information and a preset vulnerability database.
5. A CMS identification apparatus for a website, comprising:
the first extraction module is configured to extract static information of a first setting type from a source code program of a CMS to be identified, where the static information is used as a white-box feature, and the static information of the first setting type includes: a target path tree and a static resource list;
the second extraction module is used for extracting dynamic information of a second set type from the access information of the website to which the CMS to be identified belongs, and taking the dynamic information as black box characteristics; the second extraction module is specifically configured to: extracting HTTP return header content, HTTP content keywords and crawler protocol file content from the access information of the website to which the CMS to be identified belongs; extracting path features from the access information of the website to which the CMS to be identified belongs based on the target path tree, wherein the path features are the number of first URLs, the first URLs are generated based on the target path tree, and the website to which the CMS to be identified belongs can be successfully accessed based on the first URLs; extracting static resource loading characteristics from the access information of the website to which the CMS to be identified belongs based on the static resource list, wherein the static resource loading characteristics are the number of the same static resources in a static resource log in the access information of the website to which the CMS to be identified belongs as in the static resource list; taking the HTTP return header content, the HTTP content keywords, the crawler protocol file content, the path characteristics and the static resource loading characteristics as black box characteristics;
the first recognition module is used for inputting the white-box features and the black-box features into a pre-trained CMS classification model to obtain a classification result list output by the CMS classification model, and using the classification result list as a first recognition result list, wherein the classification result list comprises a plurality of classification results, and the CMS classification model is obtained by pre-training the white-box features and the black-box features extracted by CMS based on different types of websites;
the second identification module is used for identifying the CMS to be identified based on a fingerprint database rule matching algorithm to obtain a second identification result;
and the third identification module is used for identifying the CMS to be identified based on the first identification result list and the second identification result.
6. The apparatus of claim 5, wherein the second identification module is specifically configured to:
acquiring information representing the identity of the website to which the CMS to be identified belongs;
judging whether CMS type information matched with the information representing the identity of the website to which the CMS to be identified belongs exists in pre-constructed fingerprint library information;
and if so, taking the CMS type information matched with the information representing the identity of the website to which the CMS to be identified belongs as a second identification result.
7. The apparatus according to claim 5, wherein the third identifying module is specifically configured to:
determining decision weights of all classification results in the first recognition result list and decision weights of the second recognition results;
under the condition that each classification result in the first recognition result list is different from the second recognition result, taking a result corresponding to the maximum value in the decision weight as a recognition result of the CMS to be recognized;
adding the decision weight of the classification result in the first recognition result list, which is the same as the second recognition result, to the decision weight of the second recognition result when the classification result in the first recognition result list is the same as the second recognition result, and taking the added decision weight as a target decision weight;
and taking the result corresponding to the maximum value in the target decision weight and the decision weight of the classification result different from the second identification result in the first identification result list as the identification result of the CMS to be identified.
8. A security vulnerability detection apparatus of a website CMS, comprising:
the first extraction module is configured to extract static information of a first setting type from a source code program of a CMS to be identified, where the static information is used as a white-box feature, and the static information of the first setting type includes: a target path tree and a static resource list;
the second extraction module is used for extracting dynamic information of a second set type from the access information of the website to which the CMS to be identified belongs, and taking the dynamic information as black box characteristics; the second extraction module is specifically configured to: extracting HTTP return header content, HTTP content keywords and crawler protocol file content from the access information of the website to which the CMS to be identified belongs; extracting path features from the access information of the website to which the CMS to be identified belongs based on the target path tree, wherein the path features are the number of first URLs, the first URLs are generated based on the target path tree, and the website to which the CMS to be identified belongs can be successfully accessed based on the first URLs; extracting static resource loading characteristics from the access information of the website to which the CMS to be identified belongs based on the static resource list, wherein the static resource loading characteristics are the number of the same static resources in a static resource log in the access information of the website to which the CMS to be identified belongs as in the static resource list; taking the HTTP return header content, the HTTP content keywords, the crawler protocol file content, the path characteristics and the static resource loading characteristics as black box characteristics;
the first recognition module is used for inputting the white-box features and the black-box features into a pre-trained CMS classification model to obtain a classification result list output by the CMS classification model, and using the classification result list as a first recognition result list, wherein the classification result list comprises a plurality of classification results, and the CMS classification model is obtained by pre-training the white-box features and the black-box features extracted by CMS based on different types of websites;
the second identification module is used for identifying the CMS to be identified based on a fingerprint database rule matching algorithm to obtain a second identification result;
a third identification module, configured to identify the CMS to be identified based on the first identification result list and the second identification result;
the acquisition module is used for acquiring the asset information of the website based on the identification result of the CMS to be identified;
and the detection module is used for detecting the security vulnerability of the CMS to be identified based on the website asset information and a preset vulnerability database.
CN202010534459.5A 2020-06-12 2020-06-12 Website CMS (content management system) identification method and security vulnerability detection method and device Active CN111695075B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010534459.5A CN111695075B (en) 2020-06-12 2020-06-12 Website CMS (content management system) identification method and security vulnerability detection method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010534459.5A CN111695075B (en) 2020-06-12 2020-06-12 Website CMS (content management system) identification method and security vulnerability detection method and device

Publications (2)

Publication Number Publication Date
CN111695075A CN111695075A (en) 2020-09-22
CN111695075B true CN111695075B (en) 2023-04-18

Family

ID=72480762

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010534459.5A Active CN111695075B (en) 2020-06-12 2020-06-12 Website CMS (content management system) identification method and security vulnerability detection method and device

Country Status (1)

Country Link
CN (1) CN111695075B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113420818A (en) * 2021-06-27 2021-09-21 杭州迪普科技股份有限公司 Content management system identification method and device
CN115277396B (en) * 2022-08-04 2024-03-26 北京智慧星光信息技术有限公司 Message driving method and system for simulating browser operation
CN116991978B (en) * 2023-09-26 2024-01-02 杭州今元标矩科技有限公司 CMS (content management system) fragment feature extraction method, system, electronic equipment and storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9032531B1 (en) * 2012-06-28 2015-05-12 Middlegate, Inc. Identification breach detection
CN109886022A (en) * 2019-02-20 2019-06-14 北京丁牛科技有限公司 CMS kind identification method and device
CN110825941A (en) * 2019-10-17 2020-02-21 北京天融信网络安全技术有限公司 Content management system identification method, device and storage medium
CN111177618A (en) * 2019-12-17 2020-05-19 腾讯科技(深圳)有限公司 Website building method, device, equipment and computer readable storage medium

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150213389A1 (en) * 2014-01-29 2015-07-30 Adobe Systems Incorporated Determining and analyzing key performance indicators
US11537272B2 (en) * 2016-12-21 2022-12-27 Aon Global Operations Se, Singapore Branch Content management system extensions

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9032531B1 (en) * 2012-06-28 2015-05-12 Middlegate, Inc. Identification breach detection
CN109886022A (en) * 2019-02-20 2019-06-14 北京丁牛科技有限公司 CMS kind identification method and device
CN110825941A (en) * 2019-10-17 2020-02-21 北京天融信网络安全技术有限公司 Content management system identification method, device and storage medium
CN111177618A (en) * 2019-12-17 2020-05-19 腾讯科技(深圳)有限公司 Website building method, device, equipment and computer readable storage medium

Also Published As

Publication number Publication date
CN111695075A (en) 2020-09-22

Similar Documents

Publication Publication Date Title
CN111695075B (en) Website CMS (content management system) identification method and security vulnerability detection method and device
CN106599160B (en) Content rule library management system and coding method thereof
CN108090351B (en) Method and apparatus for processing request message
CN111176996A (en) Test case generation method and device, computer equipment and storage medium
US10303689B2 (en) Answering natural language table queries through semantic table representation
US9311062B2 (en) Consolidating and reusing portal information
CN112749284B (en) Knowledge graph construction method, device, equipment and storage medium
CN112688804B (en) Service platform deployment method, device, equipment and storage medium
Roy Choudhary et al. Cross-platform feature matching for web applications
CN107786529B (en) Website detection method, device and system
CN111597490A (en) Web fingerprint identification method, device, equipment and computer storage medium
CN113918794B (en) Enterprise network public opinion benefit analysis method, system, electronic equipment and storage medium
CN114285641A (en) Network attack detection method and device, electronic equipment and storage medium
CN114006749A (en) Security verification method, device, equipment and storage medium
CN111177600B (en) Built-in webpage loading method and device based on mobile application
CN115801455B (en) Method and device for detecting counterfeit website based on website fingerprint
CN109684844B (en) Webshell detection method and device, computing equipment and computer-readable storage medium
US20160342500A1 (en) Template Identification for Control of Testing
CN111400623B (en) Method and device for searching information
CN114372265A (en) Malicious program detection method and device, electronic equipment and storage medium
CN114500033B (en) Method, device, computer equipment and medium for verifying application server
CN110134377B (en) Data request processing method, device and equipment of power industry management information system
CN116361793A (en) Code detection method, device, electronic equipment and storage medium
CN117827814A (en) Data verification method, device, computer equipment and storage medium
CN115292634A (en) Website application identification method, device, equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant