CN110825941A - Content management system identification method, device and storage medium - Google Patents

Content management system identification method, device and storage medium Download PDF

Info

Publication number
CN110825941A
CN110825941A CN201910986648.3A CN201910986648A CN110825941A CN 110825941 A CN110825941 A CN 110825941A CN 201910986648 A CN201910986648 A CN 201910986648A CN 110825941 A CN110825941 A CN 110825941A
Authority
CN
China
Prior art keywords
webpage
clustering
information
sample
cms
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910986648.3A
Other languages
Chinese (zh)
Inventor
潘季明
贾蓉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Topsec Technology Co Ltd
Beijing Topsec Network Security Technology Co Ltd
Beijing Topsec Software Co Ltd
Original Assignee
Beijing Topsec Technology Co Ltd
Beijing Topsec Network Security Technology Co Ltd
Beijing Topsec Software Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Topsec Technology Co Ltd, Beijing Topsec Network Security Technology Co Ltd, Beijing Topsec Software Co Ltd filed Critical Beijing Topsec Technology Co Ltd
Priority to CN201910986648.3A priority Critical patent/CN110825941A/en
Publication of CN110825941A publication Critical patent/CN110825941A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/213Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Artificial Intelligence (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Probability & Statistics with Applications (AREA)
  • Information Transfer Between Computers (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a content management system identification method, a content management system identification device and a storage medium, which are used for improving the effectiveness of a content management system identification result. The content management system identification method comprises the following steps: acquiring webpage characteristic information from a predicted webpage, wherein the webpage characteristic information comprises at least one of the following items: the method comprises the following steps that webpage response head and response body information, static resource information, webpage directory structure information, webpage association information, webpage error page information, webpage footer technical support information and a reading guide file are obtained; extracting preset keywords from the webpage characteristic information corresponding to the predicted webpage and converting the preset keywords into corresponding characteristic vectors; and determining the webpage type corresponding to the predicted webpage by utilizing a CMS clustering model based on the feature vector, wherein the CMS clustering model is obtained by utilizing webpage feature information extracted from a webpage sample to perform clustering iteration.

Description

Content management system identification method, device and storage medium
Technical Field
The present invention relates to the field of network security technologies, and in particular, to a content management system identification method, device, and storage medium.
Background
The Content Management System (CMS) can produce comprehensive or professional websites such as news websites, social blogs, animation games, video movies, and the like in a short period based on a modular design concept. The CMS variety for enterprise or personal website building is becoming more and more popular due to the frequent changing of CMS system versions and the rapid rise of emerging systems. CMS system identification is a very important stage in the penetration test process and a key process in information collection. The CMS system of the website is identified to play a crucial role in guaranteeing the safety of the website.
In the prior art, a common CMS identification method is a fingerprint identification technology. The fingerprint refers to a piece of characteristic information which can identify the type of an object on components such as a Web application program, a database and the like, and can be used for quickly identifying a target service. The method determines the CMS type of the website by viewing MD5(Message-Digest Algorithm) or viewing rogoots. txt files of favicon, css (Cascading Style Sheets), logo. ico, js (Javascript) and other files unique to the web website. Unique files in the CMS public code are collected and then crawled by web crawlers (also known as spiders) and compared to a library of existing fingerprints for md5 values, and if the same, the system is considered to match the corresponding CMS. Txt is the first file to be viewed when a search engine accesses a web site. Txt files tell spiders what files on the server can be viewed. Txt files can be accessed to directly determine the CMS system type of the website most of the time. For example: when the robots. txt file of a certain web page contains "# robots. txt for PageAdmin CMS", the CMS system type of the web page can be determined to be PageAdmin.
However, since most fingerprints are based on the MD5 or file path of the file and a specific regular expression, the probability of fingerprint failure is high for CMS applications with frequent updates, thereby reducing the effectiveness of CMS identification results; in addition, most secondary developers will delete or mask some fingerprint features specific to CMS, such as changing the directory structure of the original static file, deleting the original CMS name character, removing robots.
Disclosure of Invention
The embodiment of the invention provides a content management system identification method, a content management system identification device and a storage medium, which are used for improving the effectiveness of identification results of a content management system.
In a first aspect, a content management system identification method is provided, including:
acquiring webpage characteristic information from a predicted webpage, wherein the webpage characteristic information comprises at least one of the following items: the method comprises the following steps that webpage response head and response body information, static resource information, webpage directory structure information, webpage association information, webpage error page information, webpage footer technical support information and a reading guide file are obtained;
extracting preset keywords from the webpage characteristic information corresponding to the predicted webpage and converting the preset keywords into corresponding characteristic vectors;
and determining the webpage type corresponding to the predicted webpage by utilizing a CMS clustering model based on the feature vector, wherein the CMS clustering model is obtained by utilizing webpage feature information extracted from a webpage sample to perform clustering iteration.
In one embodiment, the CMS clustering model is obtained by performing clustering iteration according to the following procedure using web page feature information extracted from a web page sample:
acquiring webpage characteristic information from each webpage sample;
respectively extracting preset keywords from the webpage characteristic information corresponding to each webpage sample and converting the preset keywords into corresponding characteristic vectors;
and based on the feature vectors corresponding to the webpage samples, performing clustering iteration by using a clustering algorithm according to preset clustering parameters to obtain a CMS (content management system) clustering model.
In one embodiment, the clustering algorithm comprises a fast density clustering algorithm NQ-DBSCAN based on a proximity search technique.
In one embodiment, the clustering parameters include: the method comprises the steps of a neighborhood distance threshold of a webpage sample and a sample number threshold contained in a neighborhood which is away from the webpage sample by the distance threshold.
In one embodiment, after extracting the preset keywords from the web page feature information corresponding to each web page sample, before converting the preset keywords into the corresponding feature vectors, the method further includes:
and performing dimensionality reduction on the extracted preset keywords by using a preset dimensionality reduction algorithm.
In a second aspect, there is provided a content management system identification apparatus, including:
an obtaining unit, configured to obtain web page feature information from a predicted web page, where the web page feature information includes at least one of: the method comprises the following steps that webpage response head and response body information, static resource information, webpage directory structure information, webpage association information, webpage error page information, webpage footer technical support information and a reading guide file are obtained;
the extraction and conversion unit is used for extracting preset keywords from the webpage characteristic information corresponding to the predicted webpage and converting the preset keywords into corresponding characteristic vectors;
and the determining unit is used for determining the webpage type corresponding to the predicted webpage by utilizing a CMS clustering model based on the feature vector, wherein the CMS clustering model is obtained by utilizing webpage feature information extracted from a webpage sample to perform clustering iteration.
In an embodiment, the content management system identification apparatus provided in the embodiment of the present invention further includes a clustering unit, where:
the acquiring unit is further configured to acquire, for each web page sample, web page feature information from the web page sample;
the extraction and conversion unit is further used for respectively extracting preset keywords from the webpage feature information corresponding to each webpage sample and converting the preset keywords into corresponding feature vectors;
and the clustering unit is used for performing clustering iteration by using a clustering algorithm according to preset clustering parameters based on the characteristic vectors corresponding to the webpage samples to obtain a CMS (content management system) clustering model.
In one embodiment, the clustering algorithm comprises a fast density clustering algorithm NQ-DBSCAN based on a proximity search technique.
In one embodiment, the clustering parameters include: the method comprises the steps of a neighborhood distance threshold of a webpage sample and a sample number threshold contained in a neighborhood which is away from the webpage sample by the distance threshold.
In an embodiment, the determining unit is further configured to determine, after each clustering iteration is finished, a contour coefficient corresponding to the current clustering, where the clustering coefficient includes a clustering degree and a separation degree; determining the difference value between the contour coefficient corresponding to the current clustering and the contour coefficient corresponding to the last clustering; if the difference is not larger than a preset difference threshold value, stopping iteration; otherwise, continuing clustering iteration until the difference value is not greater than the preset difference value threshold.
In an embodiment, an apparatus for identifying a content management system provided in an embodiment of the present invention further includes:
and the dimension reduction unit is used for performing dimension reduction processing on the extracted preset keywords by using a preset dimension reduction algorithm after the extraction and conversion unit respectively extracts the preset keywords from the webpage feature information corresponding to each webpage sample and before the preset keywords are converted into corresponding feature vectors.
By adopting the technical scheme, the invention at least has the following advantages:
according to the content management system identification method, the content management system identification device and the storage medium, the CMS clustering model is obtained by clustering iteration through webpage characteristic information extracted from a webpage sample, the webpage category corresponding to the predicted webpage is determined based on the obtained CMS clustering model and the characteristic vector obtained by converting the webpage characteristic information obtained from the predicted webpage, in the process, the webpage category is determined through a plurality of dimension characteristic analyses such as webpage response head and response body information, static resource information, webpage directory structure information, webpage association information, webpage error page information, webpage footer technical support information and reading guide files, and the original characteristics of the CMS system are fundamentally reserved, so that the identification effectiveness and the identification efficiency of the CMS system are improved.
Drawings
FIG. 1 is a CMS clustering model clustering iteration flow chart according to an embodiment of the present invention;
FIG. 2 is a schematic flow chart of an implementation of the content management system identification method according to the embodiment of the present invention;
FIG. 3 is a schematic structural diagram of an identification device of the content management system according to an embodiment of the present invention;
fig. 4 is a schematic structural diagram of a computing device according to an embodiment of the invention.
Detailed Description
To further explain the technical means and effects of the present invention adopted to achieve the intended purpose, the present invention will be described in detail with reference to the accompanying drawings and preferred embodiments.
It should be noted that the terms "first", "second", and the like in the description and the claims of the embodiments of the present invention and in the drawings described above are used for distinguishing similar objects and not necessarily for describing a particular order or sequence. It will be appreciated that the data so used may be interchanged under appropriate circumstances such that the embodiments described herein may be practiced otherwise than as specifically illustrated or described herein.
Reference herein to "a plurality or a number" means two or more. "and/or" describes the association relationship of the associated objects, meaning that there may be three relationships, e.g., a and/or B, which may mean: a exists alone, A and B exist simultaneously, and B exists alone. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship.
In the embodiment of the present invention, in order to improve the effectiveness of CMS identification, original features of a web page are extracted from the web page, and a CMS clustering model is obtained by performing clustering iteration using the extracted original features, the following description is provided for a clustering iteration process of the CMS clustering model with reference to fig. 1, and as shown in fig. 1, the method may include the following steps:
and S11, acquiring webpage characteristic information from each webpage sample.
In this step, at least the following web page feature information may be extracted from the web page:
1. web page response header and response body information.
2. Static resource information, such as js, css, and icon information.
3. Web page directory structure information.
4. And web page associated information, such as web server information, server operating system information, programming language information to which the web application belongs, database information, and website type information (e.g., shopping, forums, enterprises, blogs, etc.).
5. And webpage error page information, namely webpage 404 page information.
6. Web page footer technology support information.
7. Txt, a reading guide file, and the like.
And S12, respectively extracting preset keywords from the webpage feature information corresponding to each webpage sample and converting the preset keywords into corresponding feature vectors.
In this step, the following key field contents may be extracted from the web page feature information acquired in step S11:
1) the annotation is tokenized and invalid words are filtered out.
2) And extracting the js function name, and performing type statistics and frequency statistics.
3) And extracting the cs frame structure content.
4) And extracting webpage icon information.
5) And extracting a webpage directory structure.
6) Web server information is extracted.
7) Server operating system information is extracted.
8) A programming language to which the web application belongs is extracted.
9) And extracting database information.
10) And acquiring website type information (such as blogs, forums and the like) according to the webpage content.
11) The web page 404 page information is extracted.
12) Extracting effective information of readme.
After the keywords are extracted, they are converted into corresponding feature vectors.
And S13, based on the feature vectors corresponding to the webpage samples, performing clustering iteration by using a clustering algorithm according to preset clustering parameters to obtain a CMS clustering model.
In specific implementation, the embodiment of the present invention is not limited to the selection of the clustering algorithm.
In one embodiment, the clustering algorithm may use a fast density clustering algorithm (NQ-DBSCAN) based on a proximity search technique. The NQ-DBSCAN clustering algorithm uses the idea of neighbor search, and the clustering speed is accelerated by directly marking part of data points meeting the conditions as outliers or core points. The NQ-DBSCAN clustering algorithm describes the compactness of a sample set based on a group of neighborhoods, and adopts a clustering parameter (epsilon, MinPts) to describe the compactness of sample distribution of the neighborhoods. Wherein, e describes the neighborhood distance threshold of a certain webpage sample, and MinPts describes the threshold of the number of samples in the neighborhood with the distance of a certain sample being e.
In specific implementation, the distance measurement mode of each web page sample may be an euclidean distance, a manhattan distance, a chebyshev distance, or the like, which is not limited in the embodiment of the present invention.
In the embodiment of the invention, an outline Coefficient (Silhouette Coefficient) is selected as an effectiveness index to judge whether clustering iteration is stopped. The contour coefficient comprises a clustering Cohesion degree (coherence) and a Separation degree (Separation), the values are between-1 and 1, and the larger the value is, the better the clustering effect is.
In specific implementation, the contour system can be used for judging whether to stop clustering iteration according to the following procedures: after each clustering iteration is finished, determining a contour coefficient corresponding to the current clustering; determining the difference value between the contour coefficient corresponding to the current clustering and the contour coefficient corresponding to the last clustering; if the difference is not larger than a preset difference threshold value, stopping iteration; otherwise, continuing clustering iteration until the difference value is not greater than the preset difference value threshold.
In order to increase the iteration efficiency of the CMS clustering model and improve the data processing speed in the iteration process, in the embodiment of the present invention, after extracting keywords from each piece of web page feature information, a preset dimensionality reduction algorithm, for example, a dimensionality reduction algorithm such as SVD (singular value decomposition), PCA (Principal Component Analysis), or the like, may be used to perform dimensionality reduction processing on the extracted keywords, so as to filter out features with small differences.
And after the clustering iteration is finished, obtaining a stable CMS clustering model. Based on the obtained CMS clustering model, as shown in fig. 2, which is an implementation flow diagram of the content management system identification method provided by the embodiment of the present invention, the implementation flow diagram may include the following steps:
and S21, acquiring webpage characteristic information from the predicted webpage.
In this step, the obtained webpage feature information may include at least one of the following: the method comprises the following steps of webpage response head and response body information, static resource information, webpage directory structure information, webpage association information, webpage error page information, webpage footer technical support information and a reading guide file.
And S22, extracting preset keywords from the webpage characteristic information corresponding to the predicted webpage and converting the preset keywords into corresponding characteristic vectors.
And S23, determining the webpage type corresponding to the predicted webpage by using the CMS clustering model based on the obtained feature vector.
In specific implementation, the webpage feature information is obtained from the predicted webpage, the keywords are extracted from the obtained webpage feature information, preferably, the extracted keywords can be subjected to dimensionality reduction by adopting a dimensionality reduction algorithm, the dimensionality reduced keywords are converted into corresponding feature vectors, the obtained feature vectors are input into a CMS clustering model, and the webpage type corresponding to the predicted webpage is output.
The embodiment of the invention provides a content management system identification method based on an unsupervised clustering algorithm, on one hand, unsupervised learning can solve the problem of high cost of manual labeling category, on the other hand, in the content management system identification method provided by the embodiment of the invention, the NQ-DBSCAN algorithm is used for identifying and finding the webpage type by combining the source code characteristics of most web pages, in the process, the original characteristics of a CMS system are fundamentally reserved, a fingerprint database is not required to be maintained, the webpage does not need to be visited for multiple times, the data processing efficiency and the identification rate are improved, and in addition, the content management system identification method is used as abnormal data of one of prediction results and is beneficial to finding a new type of CMS system.
Based on the same technical concept, an embodiment of the present invention further provides a content management system identification apparatus, as shown in fig. 3, which may include:
an obtaining unit 31, configured to obtain web page feature information from the predicted web page, where the web page feature information includes at least one of: the method comprises the following steps that webpage response head and response body information, static resource information, webpage directory structure information, webpage association information, webpage error page information, webpage footer technical support information and a reading guide file are obtained;
the extraction and conversion unit 32 is configured to extract preset keywords from the webpage feature information corresponding to the predicted webpage and convert the preset keywords into corresponding feature vectors;
a determining unit 33, configured to determine, based on the feature vector, a webpage type corresponding to the predicted webpage by using a CMS clustering model, where the CMS clustering model is obtained by performing clustering iteration by using webpage feature information extracted from a webpage sample.
In an embodiment, the content management system identification apparatus provided in the embodiment of the present invention further includes a clustering unit, where:
the acquiring unit is further configured to acquire, for each web page sample, web page feature information from the web page sample;
the extraction and conversion unit is further used for respectively extracting preset keywords from the webpage feature information corresponding to each webpage sample and converting the preset keywords into corresponding feature vectors;
and the clustering unit is used for performing clustering iteration by using a clustering algorithm according to preset clustering parameters based on the characteristic vectors corresponding to the webpage samples to obtain a CMS (content management system) clustering model.
In one embodiment, the clustering algorithm comprises a fast density clustering algorithm NQ-DBSCAN based on a proximity search technique.
In one embodiment, the clustering parameters include: the method comprises the steps of a neighborhood distance threshold of a webpage sample and a sample number threshold contained in a neighborhood which is away from the webpage sample by the distance threshold.
In an embodiment, the determining unit is further configured to determine, after each clustering iteration is finished, a contour coefficient corresponding to the current clustering, where the clustering coefficient includes a clustering degree and a separation degree; determining the difference value between the contour coefficient corresponding to the current clustering and the contour coefficient corresponding to the last clustering; if the difference is not larger than a preset difference threshold value, stopping iteration; otherwise, continuing clustering iteration until the difference value is not greater than the preset difference value threshold.
In an embodiment, an apparatus for identifying a content management system provided in an embodiment of the present invention further includes:
and the dimension reduction unit is used for performing dimension reduction processing on the extracted preset keywords by using a preset dimension reduction algorithm after the extraction and conversion unit respectively extracts the preset keywords from the webpage feature information corresponding to each webpage sample and before the preset keywords are converted into corresponding feature vectors.
For convenience of description, the above parts are separately described as modules (or units) according to functional division. Of course, the functionality of the various modules (or units) may be implemented in the same or in multiple pieces of software or hardware in practicing the invention.
Having described the content management system identification method and apparatus according to an exemplary embodiment of the present invention, a computing apparatus according to another exemplary embodiment of the present invention is next described.
As will be appreciated by one skilled in the art, aspects of the present invention may be embodied as a system, method or program product. Thus, various aspects of the invention may be embodied in the form of: an entirely hardware embodiment, an entirely software embodiment (including firmware, microcode, etc.) or an embodiment combining hardware and software aspects that may all generally be referred to herein as a "circuit," module "or" system.
In some possible embodiments, a computing device according to the present invention may include at least one processor, and at least one memory. Wherein the memory stores program code which, when executed by the processor, causes the processor to perform the steps of the content management system identification method according to various exemplary embodiments of the present invention described above in this specification. For example, the processor may perform step S21 shown in fig. 2, obtaining web page feature information from the predicted web page, and step S22, extracting preset keywords from the web page feature information corresponding to the predicted web page and converting the preset keywords into corresponding feature vectors; and step S23, determining the webpage type corresponding to the predicted webpage by utilizing the CMS clustering model based on the obtained feature vector.
The computing device 40 according to this embodiment of the invention is described below with reference to fig. 4. The computing device 40 shown in fig. 4 is only an example and should not bring any limitations to the functionality or scope of use of embodiments of the present invention.
As shown in fig. 4, the computing apparatus 40 is embodied in the form of a general purpose computing device. Components of computing device 40 may include, but are not limited to: the at least one processor 41, the at least one memory 42, and a bus 43 connecting the various system components (including the memory 42 and the processor 41).
Bus 43 represents one or more of any of several types of bus structures, including a memory bus or memory controller, a peripheral bus, a processor, or a local bus using any of a variety of bus architectures.
The memory 42 may include readable media in the form of volatile memory, such as Random Access Memory (RAM)421 and/or cache memory 422, and may further include Read Only Memory (ROM) 423.
Memory 42 may also include a program/utility 425 having a set (at least one) of program modules 424, such program modules 424 including, but not limited to: an operating system, one or more application programs, other program modules, and program data, each of which, or some combination thereof, may comprise an implementation of a network environment.
Computing device 40 may also communicate with one or more external devices 44 (e.g., keyboard, pointing device, etc.), with one or more devices that enable a user to interact with computing device 40, and/or with any devices (e.g., router, modem, etc.) that enable computing device 40 to communicate with one or more other computing devices. Such communication may be through an input/output (I/O) interface 45. Also, computing device 40 may communicate with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network, such as the internet) through network adapter 46. As shown, the network adapter 46 communicates with other modules for the computing device 40 over the bus 43. It should be understood that although not shown in the figures, other hardware and/or software modules may be used in conjunction with computing device 40, including but not limited to: microcode, device drivers, redundant processors, external disk drive arrays, RAID systems, tape drives, and data backup storage systems, among others.
In some possible embodiments, various aspects of the content management system identification method provided by the present invention may also be implemented in the form of a program product, which includes program code for causing a computer device to execute the steps in the content management system identification method according to various exemplary embodiments of the present invention described above in this specification when the program product runs on the computer device, for example, the computer device may execute step S21 shown in fig. 2, obtain web page feature information from a predicted web page, and step S22, extract a preset keyword from the web page feature information corresponding to the predicted web page and convert the preset keyword into a corresponding feature vector; and step S23, determining the webpage type corresponding to the predicted webpage by utilizing the CMS clustering model based on the obtained feature vector.
The program product may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. A readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium include: an electrical connection having one or more wires, a portable disk, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
The program product for content management system identification of embodiments of the present invention may employ a portable compact disk read only memory (CD-ROM) and include program code, and may be run on a computing device. However, the program product of the present invention is not limited in this regard and, in the present document, a readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.
A readable signal medium may include a propagated data signal with readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A readable signal medium may also be any readable medium that is not a readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.
Program code embodied on a readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
Program code for carrying out operations for aspects of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device and partly on a remote computing device, or entirely on the remote computing device or server. In the case of a remote computing device, the remote computing device may be connected to the user computing device over any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., over the internet using an internet service provider).
It should be noted that although several units or sub-units of the apparatus are mentioned in the above detailed description, such division is merely exemplary and not mandatory. Indeed, the features and functions of two or more of the units described above may be embodied in one unit, according to embodiments of the invention. Conversely, the features and functions of one unit described above may be further divided into embodiments by a plurality of units.
Moreover, while the operations of the method of the invention are depicted in the drawings in a particular order, this does not require or imply that the operations must be performed in this particular order, or that all of the illustrated operations must be performed, to achieve desirable results. Additionally or alternatively, certain steps may be omitted, multiple steps combined into one step execution, and/or one step broken down into multiple step executions.
As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
While the invention has been described in connection with specific embodiments thereof, it is to be understood that it is intended by the appended drawings and description that the invention may be embodied in other specific forms without departing from the spirit or scope of the invention.

Claims (10)

1. A method for identifying a content management system, comprising:
acquiring webpage characteristic information from a predicted webpage, wherein the webpage characteristic information comprises at least one of the following items: the method comprises the following steps that webpage response head and response body information, static resource information, webpage directory structure information, webpage association information, webpage error page information, webpage footer technical support information and a reading guide file are obtained;
extracting preset keywords from the webpage characteristic information corresponding to the predicted webpage and converting the preset keywords into corresponding characteristic vectors;
and determining the webpage type corresponding to the predicted webpage by utilizing a CMS clustering model based on the feature vector, wherein the CMS clustering model is obtained by utilizing webpage feature information extracted from a webpage sample to perform clustering iteration.
2. The method of claim 1, wherein the CMS clustering model is obtained by performing clustering iteration according to the following procedure using web page feature information extracted from a web page sample:
acquiring webpage characteristic information from each webpage sample;
respectively extracting preset keywords from the webpage characteristic information corresponding to each webpage sample and converting the preset keywords into corresponding characteristic vectors;
and based on the feature vectors corresponding to the webpage samples, performing clustering iteration by using a clustering algorithm according to preset clustering parameters to obtain a CMS (content management system) clustering model.
3. The method of claim 2, wherein the clustering algorithm comprises a fast density clustering algorithm NQ-DBSCAN based on a proximity search technique.
4. The method of claim 3, wherein the clustering parameters comprise: the method comprises the steps of a neighborhood distance threshold of a webpage sample and a sample number threshold contained in a neighborhood which is away from the webpage sample by the distance threshold.
5. The method of claim 2, 3 or 4, further comprising, after each iteration of clustering is complete:
determining a contour coefficient corresponding to the current clustering, wherein the clustering coefficient comprises clustering cohesion and separation;
determining the difference value between the contour coefficient corresponding to the current clustering and the contour coefficient corresponding to the last clustering;
if the difference is not larger than a preset difference threshold value, stopping iteration; otherwise, continuing clustering iteration until the difference value is not greater than the preset difference value threshold.
6. The method according to claim 2, 3 or 4, wherein after extracting the preset keywords from the web page feature information corresponding to each web page sample, before converting the preset keywords into the corresponding feature vectors, the method further comprises:
and performing dimensionality reduction on the extracted preset keywords by using a preset dimensionality reduction algorithm.
7. A content management system identification apparatus, comprising:
an obtaining unit, configured to obtain web page feature information from a predicted web page, where the web page feature information includes at least one of: the method comprises the following steps that webpage response head and response body information, static resource information, webpage directory structure information, webpage association information, webpage error page information, webpage footer technical support information and a reading guide file are obtained;
the extraction and conversion unit is used for extracting preset keywords from the webpage characteristic information corresponding to the predicted webpage and converting the preset keywords into corresponding characteristic vectors;
and the determining unit is used for determining the webpage type corresponding to the predicted webpage by utilizing a CMS clustering model based on the feature vector, wherein the CMS clustering model is obtained by utilizing webpage feature information extracted from a webpage sample to perform clustering iteration.
8. The apparatus of claim 7, further comprising a clustering unit, wherein:
the acquiring unit is further configured to acquire, for each web page sample, web page feature information from the web page sample;
the extraction and conversion unit is further used for respectively extracting preset keywords from the webpage feature information corresponding to each webpage sample and converting the preset keywords into corresponding feature vectors;
and the clustering unit is used for performing clustering iteration by using a clustering algorithm according to preset clustering parameters based on the characteristic vectors corresponding to the webpage samples to obtain a CMS (content management system) clustering model.
9. A computing device, the computing device comprising: memory, processor and computer program stored on the memory and executable on the processor, which computer program, when executed by the processor, carries out the steps of the method according to any one of claims 1 to 6.
10. A computer storage medium having stored thereon a computer program which, when executed by a processor, carries out the steps of the method according to any one of claims 1 to 6.
CN201910986648.3A 2019-10-17 2019-10-17 Content management system identification method, device and storage medium Pending CN110825941A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910986648.3A CN110825941A (en) 2019-10-17 2019-10-17 Content management system identification method, device and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910986648.3A CN110825941A (en) 2019-10-17 2019-10-17 Content management system identification method, device and storage medium

Publications (1)

Publication Number Publication Date
CN110825941A true CN110825941A (en) 2020-02-21

Family

ID=69549683

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910986648.3A Pending CN110825941A (en) 2019-10-17 2019-10-17 Content management system identification method, device and storage medium

Country Status (1)

Country Link
CN (1) CN110825941A (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111695075A (en) * 2020-06-12 2020-09-22 国网浙江省电力有限公司信息通信分公司 Website CMS (content management system) identification method and security vulnerability detection method and device
CN112131508A (en) * 2020-09-25 2020-12-25 深信服科技股份有限公司 Method, equipment, device and medium for identifying fingerprint of website application framework
CN112434250A (en) * 2020-12-15 2021-03-02 安徽三实信息技术服务有限公司 CMS (content management system) identification feature rule extraction method based on online website
CN113420818A (en) * 2021-06-27 2021-09-21 杭州迪普科技股份有限公司 Content management system identification method and device
CN113723100A (en) * 2021-09-09 2021-11-30 国网电子商务有限公司 Open source component identification method and device based on fingerprint characteristics
CN115146712A (en) * 2022-06-15 2022-10-04 北京天融信网络安全技术有限公司 Internet of things asset identification method, device, equipment and storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2010123000A (en) * 2008-11-20 2010-06-03 Nippon Telegr & Teleph Corp <Ntt> Web page group extraction method, device and program
CN103530367A (en) * 2013-10-12 2014-01-22 深圳先进技术研究院 Phishing netsite identification system and method
CN105824822A (en) * 2015-01-05 2016-08-03 任子行网络技术股份有限公司 Method clustering phishing page to locate target page
CN108549904A (en) * 2018-03-28 2018-09-18 西安理工大学 Difference secret protection K-means clustering methods based on silhouette coefficient
CN109150817A (en) * 2017-11-24 2019-01-04 新华三信息安全技术有限公司 A kind of web-page requests recognition methods and device

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2010123000A (en) * 2008-11-20 2010-06-03 Nippon Telegr & Teleph Corp <Ntt> Web page group extraction method, device and program
CN103530367A (en) * 2013-10-12 2014-01-22 深圳先进技术研究院 Phishing netsite identification system and method
CN105824822A (en) * 2015-01-05 2016-08-03 任子行网络技术股份有限公司 Method clustering phishing page to locate target page
CN109150817A (en) * 2017-11-24 2019-01-04 新华三信息安全技术有限公司 A kind of web-page requests recognition methods and device
CN108549904A (en) * 2018-03-28 2018-09-18 西安理工大学 Difference secret protection K-means clustering methods based on silhouette coefficient

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111695075A (en) * 2020-06-12 2020-09-22 国网浙江省电力有限公司信息通信分公司 Website CMS (content management system) identification method and security vulnerability detection method and device
CN111695075B (en) * 2020-06-12 2023-04-18 国网浙江省电力有限公司信息通信分公司 Website CMS (content management system) identification method and security vulnerability detection method and device
CN112131508A (en) * 2020-09-25 2020-12-25 深信服科技股份有限公司 Method, equipment, device and medium for identifying fingerprint of website application framework
CN112434250A (en) * 2020-12-15 2021-03-02 安徽三实信息技术服务有限公司 CMS (content management system) identification feature rule extraction method based on online website
CN113420818A (en) * 2021-06-27 2021-09-21 杭州迪普科技股份有限公司 Content management system identification method and device
CN113723100A (en) * 2021-09-09 2021-11-30 国网电子商务有限公司 Open source component identification method and device based on fingerprint characteristics
CN113723100B (en) * 2021-09-09 2023-10-13 国网数字科技控股有限公司 Open source component identification method and device based on fingerprint characteristics
CN115146712A (en) * 2022-06-15 2022-10-04 北京天融信网络安全技术有限公司 Internet of things asset identification method, device, equipment and storage medium

Similar Documents

Publication Publication Date Title
CN110825941A (en) Content management system identification method, device and storage medium
US11321421B2 (en) Method, apparatus and device for generating entity relationship data, and storage medium
US9626159B2 (en) Automatic generation of task scripts from web browsing interaction history
US9298680B2 (en) Display of hypertext documents grouped according to their affinity
US10078632B2 (en) Collecting training data using anomaly detection
US20220318275A1 (en) Search method, electronic device and storage medium
US9256593B2 (en) Identifying product references in user-generated content
US9251287B2 (en) Automatic detection of item lists within a web page
US20170124213A1 (en) Automating Web Tasks Based on Web Browsing Histories and User Actions
CN103544255A (en) Text semantic relativity based network public opinion information analysis method
US20150324350A1 (en) Identifying Content Relationship for Content Copied by a Content Identification Mechanism
CN107102993B (en) User appeal analysis method and device
US8639559B2 (en) Brand analysis using interactions with search result items
Sarne et al. Unsupervised topic extraction from privacy policies
CN112231598A (en) Webpage path navigation method and device, electronic equipment and storage medium
CN113660541A (en) News video abstract generation method and device
CN112000929A (en) Cross-platform data analysis method, system, equipment and readable storage medium
Tahir et al. Corpulyzer: A novel framework for building low resource language corpora
US9613012B2 (en) System and method for automatically generating keywords
CN106326236A (en) Webpage content identification method and system
CN112685534B (en) Method and apparatus for generating context information of authored content during authoring process
CN107527289B (en) Investment portfolio industry configuration method, device, server and storage medium
US10250705B2 (en) Interaction trajectory retrieval
CN112417996A (en) Information processing method and device for industrial drawing, electronic equipment and storage medium
CN110069691A (en) For handling the method and apparatus for clicking behavioral data

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20200221

RJ01 Rejection of invention patent application after publication