CN113468391B - Website information clustering method and device, electronic device and computer equipment - Google Patents

Website information clustering method and device, electronic device and computer equipment Download PDF

Info

Publication number
CN113468391B
CN113468391B CN202110791002.7A CN202110791002A CN113468391B CN 113468391 B CN113468391 B CN 113468391B CN 202110791002 A CN202110791002 A CN 202110791002A CN 113468391 B CN113468391 B CN 113468391B
Authority
CN
China
Prior art keywords
asset data
website
information
clustering
website information
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110791002.7A
Other languages
Chinese (zh)
Other versions
CN113468391A (en
Inventor
宋建昌
范渊
黄进
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
DBAPPSecurity Co Ltd
Original Assignee
DBAPPSecurity Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by DBAPPSecurity Co Ltd filed Critical DBAPPSecurity Co Ltd
Priority to CN202110791002.7A priority Critical patent/CN113468391B/en
Publication of CN113468391A publication Critical patent/CN113468391A/en
Application granted granted Critical
Publication of CN113468391B publication Critical patent/CN113468391B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/906Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/231Hierarchical techniques, i.e. dividing or merging pattern sets so as to obtain a dendrogram
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
    • G06Q50/10Services
    • G06Q50/26Government or public services

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Business, Economics & Management (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Tourism & Hospitality (AREA)
  • Databases & Information Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Economics (AREA)
  • General Business, Economics & Management (AREA)
  • Health & Medical Sciences (AREA)
  • Development Economics (AREA)
  • General Health & Medical Sciences (AREA)
  • Human Resources & Organizations (AREA)
  • Marketing (AREA)
  • Primary Health Care (AREA)
  • Strategic Management (AREA)
  • Educational Administration (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application relates to a website information clustering method, a device, an electronic device and computer equipment, wherein the website information clustering method comprises the following steps: the method comprises the steps of obtaining website information of a target website, carrying out multidimensional association analysis on the website information to obtain asset data associated with the target website, carrying out vectorization processing on the asset data according to a predetermined website information index to obtain multi-dimensional vectors of the asset data, carrying out hierarchical clustering processing on the multi-dimensional vectors to obtain clusters of the asset data, and determining the intimacy of the asset data and the target website according to the clusters of the asset data. The method realizes clustering of the asset data acquired by different dimensionalities of the website information through hierarchical clustering, and further realizes calculation of the relevance between the target website and the associated website, thereby improving the accuracy of website association analysis.

Description

Website information clustering method and device, electronic device and computer equipment
Technical Field
The application relates to the technical field of website expansion analysis, in particular to a website information clustering method, a device, an electronic device and computer equipment.
Background
With the development of science and technology, the phenomenon of internet crime is increased day by day, and the production and the life of people are seriously influenced. Since internet crimes are typically conducted in the form of a partner, there may be associated information between different illicit websites. When searching illegal websites, after determining a target illegal website, the information of other websites related to the target illegal website is obtained through analysis of independent dimensions such as domain name analysis or IP information analysis, and the information of related websites is obtained through expansion of information of a single dimension of the website, so that the independence of information of each dimension for the related analysis is higher, and the accuracy of a result of the related analysis is lower.
Aiming at the problem of low accuracy of website related information analysis in the related art, no effective solution is proposed at present.
Disclosure of Invention
The embodiment provides a website information clustering method, device, electronic device and computer equipment, so as to solve the problem of low accuracy of website associated information analysis in the related technology.
In a first aspect, in this embodiment, there is provided a website information clustering method, including:
Acquiring website information of a target website;
performing multidimensional association analysis on the website information to obtain asset data associated with the target website;
vectorizing the asset data according to a predetermined website information index to obtain a multidimensional vector of the asset data;
And carrying out hierarchical clustering processing on the multi-dimensional vector to obtain clusters of the asset data, and determining the close-contact degree of the asset data and the target website according to the clusters of the asset data.
In some embodiments, the multidimensional association analysis of the website information to obtain asset data associated with the target website includes:
Extracting multidimensional information of the target website from the website information;
And respectively carrying out association analysis on the multi-dimensional information to obtain asset data associated with each piece of the multi-dimensional information.
In some embodiments, the asset data includes first asset data and second asset data, and the multidimensional association analysis is performed on the website information to obtain asset data associated with the target website, and the method further includes:
Multidimensional association analysis is carried out on the website information to obtain first asset data directly associated with the website information;
And performing expansion association on the first asset data to obtain second asset data directly associated with the first asset data.
In some embodiments, the vectorizing the asset data according to a predetermined website information index to obtain a multidimensional vector of the asset data includes:
and carrying out vectorization processing on the asset data according to the similarity of the asset data with the website information on the predetermined website information indexes to obtain a multidimensional vector of the asset data.
In some embodiments, the performing hierarchical clustering on the multi-dimensional vector to obtain the cluster of the asset data includes:
Hierarchical clustering is carried out on the multi-dimensional vectors to obtain the distance between the multi-dimensional vectors;
and clustering the dimension vectors according to the distance between the dimension vectors to obtain the clustering of the asset data.
In some embodiments, the website information of the target website includes SSL certificate information, IPC record information, web page response information, and IP information.
In a second aspect, in this embodiment, there is provided a website information clustering apparatus, including: the device comprises an acquisition module, an association module, a vectorization module and a clustering module, wherein:
the acquisition module is used for acquiring website information of a target website;
the association module is used for carrying out multidimensional association analysis on the website information to obtain asset data associated with the target website;
The vectorization module is used for vectorizing the asset data according to a predetermined website information index to obtain a multidimensional vector of the asset data;
And the clustering module is used for carrying out hierarchical clustering processing on the multi-dimensional vector to obtain the clustering of the asset data, and determining the closeness between the asset data and the target website according to the clustering of the asset data.
In a third aspect, in this embodiment, there is provided an electronic device, including a memory, a processor, and a computer program stored in the memory and capable of running on the processor, where the processor implements the website information clustering method described in the first aspect when executing the computer program.
In a fourth aspect, in this embodiment, there is provided a computer device, including a memory, a processor, and a computer program stored in the memory and capable of running on the processor, where the processor implements the website information clustering method described in the first aspect when executing the computer program.
In a fifth aspect, in this embodiment, there is provided a storage medium having stored thereon a computer program that, when executed by a processor, implements the website information clustering method described in the first aspect above.
The website information clustering method, the device, the electronic device and the computer equipment acquire website information of the target website, perform multidimensional association analysis on the website information to obtain asset data associated with the target website, perform vectorization processing on the asset data according to a predetermined website information index to obtain a multidimensional vector of the asset data, perform hierarchical clustering processing on the multidimensional vector to obtain clusters of the asset data, and determine the closeness of the asset data and the target website according to the clusters of the asset data. The method realizes clustering of the asset data acquired by different dimensionalities of the website information through hierarchical clustering, and further realizes calculation of the relevance between the target website and the associated website, thereby improving the accuracy of website association analysis.
The details of one or more embodiments of the application are set forth in the accompanying drawings and the description below to provide a more thorough understanding of the other features, objects, and advantages of the application.
Drawings
The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this specification, illustrate embodiments of the application and together with the description serve to explain the application and do not constitute a limitation on the application. In the drawings:
fig. 1 is a hardware structural block diagram of a terminal of a related art website information clustering method;
fig. 2 is a flowchart of a website information clustering method of the present embodiment;
FIG. 3 is a flowchart of another method for clustering website information according to the present embodiment;
Fig. 4 is a schematic structural diagram of a website information clustering apparatus according to the present embodiment.
Detailed Description
The present application will be described and illustrated with reference to the accompanying drawings and examples for a clearer understanding of the objects, technical solutions and advantages of the present application.
Unless defined otherwise, technical or scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terms "a," "an," "the," "these" and similar terms in this application are not intended to be limiting in number, but may be singular or plural. The terms "comprising," "including," "having," and any variations thereof, as used herein, are intended to encompass non-exclusive inclusion; for example, a process, method, and system, article, or apparatus that comprises a list of steps or modules (units) is not limited to the list of steps or modules (units), but may include other steps or modules (units) not listed or inherent to such process, method, article, or apparatus. The terms "connected," "coupled," and the like in this disclosure are not limited to physical or mechanical connections, but may include electrical connections, whether direct or indirect. The term "plurality" as used herein means two or more. "and/or" describes an association relationship of an association object, meaning that there may be three relationships, e.g., "a and/or B" may mean: a exists alone, A and B exist together, and B exists alone. Typically, the character "/" indicates that the associated object is an "or" relationship. The terms "first," "second," "third," and the like, as referred to in this disclosure, merely distinguish similar objects and do not represent a particular ordering for objects.
The method embodiments provided in the present embodiment may be executed in a terminal, a computer, or similar computing device. For example, the method runs on a terminal, and fig. 1 is a block diagram of the hardware structure of the terminal of the website information clustering method of the present embodiment. As shown in fig. 1, the terminal may include one or more (only one is shown in fig. 1) processors 102 and a memory 104 for storing data, wherein the processors 102 may include, but are not limited to, a microprocessor MCU, a programmable logic device FPGA, or the like. The terminal may also include a transmission device 106 for communication functions and an input-output device 108. It will be appreciated by those skilled in the art that the structure shown in fig. 1 is merely illustrative and is not intended to limit the structure of the terminal. For example, the terminal may also include more or fewer components than shown in fig. 1, or have a different configuration than shown in fig. 1.
The memory 104 may be used to store a computer program, for example, a software program of application software and a module, such as a computer program corresponding to the website information clustering method in the present embodiment, and the processor 102 executes the computer program stored in the memory 104, thereby performing various functional applications and data processing, that is, implementing the above-described method. Memory 104 may include high-speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some examples, the memory 104 may further include memory remotely located relative to the processor 102, which may be connected to the terminal via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
The transmission device 106 is used to receive or transmit data via a network. The network includes a wireless network provided by a communication provider of the terminal. In one example, the transmission device 106 includes a network adapter (Network Interface Controller, simply referred to as a NIC) that can connect to other network devices through a base station to communicate with the internet. In one example, the transmission device 106 may be a Radio Frequency (RF) module, which is configured to communicate with the internet wirelessly.
In this embodiment, a website information clustering method is provided, and fig. 2 is a flowchart of the website information clustering method in this embodiment, as shown in fig. 2, where the flowchart includes the following steps:
Step S210, acquiring website information of a target website.
The target website may be a illegal website which is found. In order to obtain other website information associated with the target website, the multi-dimensional website information of the target website may be obtained first. Specifically, the website information of the target website may include SSL (Secure socket layer, security protocol) certificates of the target website, ICP (Internet Content Provider, web content provider) docket information, web page response information, IP (Internet Protocol ) information, and domain name information.
And step S220, carrying out multidimensional association analysis on the website information to obtain asset data associated with the target website.
Specifically, if the target website includes an SSL certificate, asset data identical to the serial number of the SSL certificate may be obtained from the domain name asset library based on the SSL certificate of the target website. In addition, asset data with the same ICP record number can be always acquired from the domain name asset library according to the ICP record number of the target website. Asset data similar to the response results of the target website can also be obtained from the domain name asset library by utilizing simhash calculation. Corresponding asset data can also be obtained according to the IP address in the record A resolved by dns. Additionally, asset data corresponding to the IP address of the same C end of the target website can be obtained according to the dns resolution technology. And acquiring the webpage response information and the IP information of the asset data while acquiring the asset data of the target website.
Additionally, after the asset data directly associated with the target website is obtained according to the website information of the target website, the asset data associated with the asset data directly associated with the target website may be further obtained. Specifically, after the asset data directly associated with the target website is obtained according to the SSL certificate, ICP record information and webpage response information of the target website, dns analysis is further performed on the asset data directly associated with the target website, so as to obtain corresponding asset data.
Asset data acquired based on website information of the target website are associated with the target website in different dimensions, so that a set of asset data associated with the target website in each dimension is formed, and the method can be used for extracting subsequent multidimensional indexes and calculating the relevance associated with the target website.
And step S230, carrying out vectorization processing on the asset data according to a predetermined website information index to obtain a multidimensional vector of the asset data.
The website information index may be determined according to the actual application scenario and the web page content of the target website, for example, the website information index may be a website title, a web page simhash value, an IP address, an ICP record number, contact information, SSL certificate information, and the like. After determining the website information index, the asset data associated with the target website is vectorized according to the association relationship between the asset data and the target website on each website information index determined in step S220, so as to obtain the multidimensional vector corresponding to the asset data. Specifically, the value of the asset data on each website information index can be determined according to whether the asset data is equal to the data of the target website on the corresponding indexes on the indexes such as the website title, the webpage simhash value, the IP address, the ICP record number, the contact information, the SSL certificate information and the like, so as to obtain the multidimensional vector of the asset data.
Further, the value of the asset data on each website information index is determined according to the association relation between the asset data on each website information index and the target website, and the value can be determined according to the weight preset for each website information index and the association process for acquiring the asset data. For example, if the asset data a is identical to the target website in the website title, the value of the asset data a in the website information index of the website title is 0. The data of the asset data A on the index of the SSL certificate information is the same as the target website, the weight of the SSL certificate information is m, the asset data A is obtained by correlating the target website according to the IP address information of the asset data B after the asset data B is obtained by correlating the target website according to the webpage response information, and the numerical value of the asset data A on the index of the SSL certificate information can be set to be m 2. In addition, because the importance of different website information indexes in the website association relationship may be different, for example, the closeness of two websites with the same value of the web page simhash may be different from that of the web page simhash, but the closeness of two websites with the same IP segment may be higher, so the weights set by the website information indexes may be different.
The multi-dimensional vector of the asset data is obtained by vectorizing the asset data, the numerical value of the asset data associated with the target website on each website information index can be obtained, the association relation of each asset data and the target website abstraction is quantized into a specific numerical value, so that basic data is provided for subsequent calculation of the affinities of each asset data and the target website, and the numerical values of different indexes form the multi-dimensional vector.
Step S240, hierarchical clustering processing is carried out on the multidimensional vector to obtain clustering of the asset data, and the closeness of the asset data and the target website is determined according to the clustering of the asset data.
The hierarchical clustering is an unsupervised learning algorithm, and calculates the similarity between nodes through a certain similarity measure, and gradually reconnects the nodes according to the sequence from high to low of the similarity. The hierarchical clustering algorithm is applied to the processing of the multi-dimensional vectors, and the distances between the multi-dimensional vectors corresponding to the asset data can be obtained, wherein the smaller the calculated distance between the two asset data is, the higher the similarity between the two asset data is. After the distances between the different multi-dimensional vectors are obtained, two multi-dimensional vectors closest to the distance can be combined, so that a cluster tree is generated. After continuous calculation and clustering, the clustering of the asset data can be obtained, wherein the clustering of the asset data can be one or more. It will be appreciated that the number of components,
Additionally, a cluster group of the target website can be determined, and the closeness of other asset data and the target website can be analyzed according to the distance between the other asset data in the cluster group and the target website. In addition, asset data in other clusters are less intimate to the target web site than asset data in the same cluster of the target web site. Therefore, after the clustering of the asset data is obtained, the closeness of different asset data and the target website can be obtained through analysis.
Step S210 to step S240 are described above, the website information of the target website is obtained, multidimensional association analysis is performed on the website information to obtain asset data associated with the target website, vectorization processing is performed on the asset data according to a predetermined website information index to obtain a multidimensional vector of the asset data, hierarchical clustering processing is performed on the multidimensional vector to obtain clusters of the asset data, and the degree of closeness between the asset data and the target website is determined according to the clusters of the asset data. The method realizes clustering of the asset data acquired by different dimensionalities of the website information through hierarchical clustering, and further realizes calculation of the relevance between the target website and the associated website, thereby improving the accuracy of website association analysis.
Further, in one embodiment, based on the step S220, multidimensional association analysis is performed on the website information to obtain asset data associated with the target website, which specifically includes the following steps:
step S221, extracting multi-dimension information of the target website from the website information.
Specifically, the multidimensional information of the target website, such as SSL certificates, ICP record information, web page response information, IP information, domain name information, and the like, of the target website is obtained from the website information, respectively.
Step S222, respectively carrying out association analysis on the multi-dimensional information to obtain asset data associated with each piece of the multi-dimensional information.
Specifically, association analysis can be performed on SSL certificates, ICP record information, web page response information, IP information, domain name information, and the like of the target website, respectively, to obtain asset data associated with each dimension information in the multi-dimension information.
The multi-dimensional information of the target website is extracted from the website information, and the multi-dimensional information is respectively subjected to association analysis to obtain the asset data associated with each piece of the multi-dimensional information, so that the asset data associated with the multi-dimensional information of the target website is obtained, the more comprehensive associated asset data of the target website can be obtained, and the efficiency of searching the associated website based on the target website is improved.
Additionally, in one embodiment, based on the step S220, the asset data includes first asset data and second asset data, and the multidimensional association analysis is performed on the website information to obtain asset data associated with the target website, and specifically further includes the following steps:
Step S223, multidimensional association analysis is carried out on the website information to obtain first asset data directly associated with the website information.
The first asset data is asset data directly associated with a target website, wherein the first asset data is obtained by directly carrying out multidimensional association analysis on website information of the target website.
Step S224, performing expansion association on the first asset data to obtain second asset data directly associated with the first asset data.
The second asset data is obtained by performing an operation such as dns on the first asset data after the first asset data is obtained, and thus the second asset data is directly associated with the first asset data and indirectly associated with the target website.
By acquiring the first asset data directly associated with the target website and the second asset data directly associated with the first asset data, the efficiency of finding the associated website based on the target website can also be improved.
Additionally, in one embodiment, based on the step S230, the vectorizing process is performed on the asset data according to the predetermined website information index to obtain the multidimensional vector of the asset data, which specifically includes the following steps:
in step S231, vectorizing the asset data according to the similarity between the asset data and the website information on the predetermined website information indexes, so as to obtain a multidimensional vector of the asset data.
Additionally, in one embodiment, based on the step S240, hierarchical clustering is performed on the multidimensional vector to obtain clusters of asset data, which specifically includes the following steps:
Step S241, hierarchical clustering processing is performed on the multi-dimensional vectors to obtain distances among the multi-dimensional vectors.
For example, the multidimensional vectors corresponding to the 4 asset data are X1 (a 1, b1, c1, d1, e1, f1, g1, h 1), X2 (a 2, b2, c2, d2, e2, f2, g2, h 2), X3 (a 3, b3, c3, d3, e3, f3, g3, h 3), and X4 (a 4, b4, c4, d4, e4, f4, g4, h 4), respectively. And calculating the distances between the multidimensional vector corresponding to each asset data and all other multidimensional vectors in the 4 asset data through a hierarchical clustering algorithm to determine the similarity between the asset data. Specifically, taking the example of calculating the distance between the multidimensional vectors X1 and X2, in combination with the formula (1), the distance D (X1, X2) between X1 and X2 can be obtained as:
Wherein the denominator in equation (1) may be the same dimension as the multi-dimensional vector.
Step S242, clustering the dimension vectors according to the distance between the dimension vectors to obtain the clustering of the asset data.
After the distances among the multidimensional vectors are obtained, D (X1, X2), D (X1, X3), D (X1, X4), D (X2, X3), D (X2, X4) and D (X3, X4), the distances from other multidimensional vectors to the vector combination are calculated from the vector combination with the smallest distance value, a new vector combination is obtained according to the minimum distance from other multidimensional vectors to the vector combination, and clustering of asset data is obtained through continuous calculation. Wherein, the minimum distance is D (X1, X2), and the distance from the multidimensional vector X3 to the vector combination (X1, X2) is described by combining the formula (2):
further, in one embodiment, the website information of the target website includes SSL certificate information, IPC record information, web page response information, and IP information.
Step S210 to step S242, extracting the multi-dimensional information of the target website from the website information, respectively carrying out association analysis on the multi-dimensional information to obtain asset data associated with each dimensional information in the multi-dimensional information, carrying out multi-dimensional association analysis on the website information to obtain first asset data directly associated with the website information, carrying out expansion association on the first asset data to obtain second asset data directly associated with the first asset data, and obtaining more comprehensive associated asset data of the target website, thereby improving the efficiency of searching associated websites based on the target website; according to the similarity between the asset data and the website information on the predetermined website information indexes, vectorizing the asset data to obtain multi-dimensional vectors of the asset data, and quantifying the association relation between each asset data and the abstract target website into a specific numerical value; hierarchical clustering processing is carried out on the multi-dimensional vectors to obtain distances among the multi-dimensional vectors, clustering is carried out on the multi-dimensional vectors according to the distances among the multi-dimensional vectors to obtain clustering of asset data, and therefore calculation of the closeness of the asset data and a target website is achieved based on hierarchical clustering, and therefore accuracy of website association analysis is improved.
The present embodiment is described and illustrated below by way of preferred embodiments.
Fig. 3 is a flowchart of the website information clustering method of the present preferred embodiment. As shown in fig. 3, the method comprises the following steps:
Step S310, collecting data of a target website;
Step S321, corresponding asset data, response body information and IP address information are obtained according to SSL certificates of data of a target website;
Step S322, corresponding asset data, response body information and IP address information are obtained according to the ICP record number of the data of the target website;
Step S323, carrying out simhash calculation on the response result of the target website to obtain corresponding asset data, response body information and IP address information;
Step S324, the data of the target website is processed by utilizing dns resolution, and corresponding asset data, response body information and IP address information are obtained according to the IP address in the record A analyzed by dns;
Step S325, asset data, response body information and IP address information corresponding to the IP address on the section C are obtained according to the IP address analyzed by the dns;
Step S326, performing expansion processing of step 325 and step S326 on the asset data obtained in steps S321, S322 and S323 to further obtain other asset data;
step S330, vectorizing the asset data based on the multidimensional index;
Step S340, analyzing vectors corresponding to the asset data by using hierarchical clustering to obtain one or more clusters;
and step S350, obtaining the confidentiality degree of each asset data and the target website based on each cluster.
It should be noted that the steps illustrated in the above-described flow or flow diagrams of the figures may be performed in a computer system, such as a set of computer-executable instructions, and that, although a logical order is illustrated in the flow diagrams, in some cases, the steps illustrated or described may be performed in an order other than that illustrated herein. For example, the execution order of steps S321, S322, and S323 may be interchanged.
In this embodiment, a website information clustering device is further provided, and the device is used to implement the foregoing embodiments and preferred embodiments, and is not described in detail. The terms "module," "unit," "sub-unit," and the like as used below may refer to a combination of software and/or hardware that performs a predetermined function. While the means described in the following embodiments are preferably implemented in software, implementations in hardware, or a combination of software and hardware, are also possible and contemplated.
Fig. 4 is a block diagram of the configuration of the website information clustering apparatus 40 of the present embodiment, and as shown in fig. 4, the website information clustering apparatus 40 includes: an acquisition module 42, an association module 44, a vectorization module 46, and a clustering module 48, wherein:
an acquisition module 42, configured to acquire website information of a target website;
The association module 44 is configured to perform multidimensional association analysis on the website information to obtain asset data associated with the target website;
vectorization module 46, configured to vectorize the asset data according to a predetermined website information index, to obtain a multidimensional vector of the asset data;
The clustering module 48 is configured to perform hierarchical clustering processing on the multidimensional vector to obtain clusters of asset data, and determine the closeness between the asset data and the target website according to the clusters of the asset data.
The website information clustering device 40 acquires website information of a target website, performs multidimensional association analysis on the website information to obtain asset data associated with the target website, performs vectorization processing on the asset data according to a predetermined website information index to obtain a multidimensional vector of the asset data, performs hierarchical clustering processing on the multidimensional vector to obtain clusters of the asset data, and determines the closeness between the asset data and the target website according to the clusters of the asset data. The method realizes clustering of the asset data acquired by different dimensionalities of the website information through hierarchical clustering, and further realizes calculation of the relevance between the target website and the associated website, thereby improving the accuracy of website association analysis.
In one embodiment, the association module 44 is further configured to extract the multi-dimensional information of the target website from the website information, and perform association analysis on the multi-dimensional information to obtain asset data associated with each of the multi-dimensional information.
In one embodiment, the association module 44 is further configured to perform multidimensional association analysis on the website information to obtain first asset data directly associated with the website information, and perform extended association on the first asset data to obtain second asset data directly associated with the first asset data.
In one embodiment, the vectorizing module 46 is further configured to vectorize the asset data according to the similarity between the asset data and the website information on the predetermined website information indexes, so as to obtain a multidimensional vector of the asset data.
In one embodiment, the clustering module 48 is further configured to perform hierarchical clustering on the multi-dimensional vectors to obtain distances between the multi-dimensional vectors, and cluster the multi-dimensional vectors according to the distances between the multi-dimensional vectors to obtain clusters of the asset data.
In one embodiment, the website information of the target website includes SSL certificate information, IPC record information, web page response information, and IP information.
The above-described respective modules may be functional modules or program modules, and may be implemented by software or hardware. For modules implemented in hardware, the various modules described above may be located in the same processor; or the above modules may be located in different processors in any combination.
There is also provided in this embodiment an electronic device comprising a memory having stored therein a computer program and a processor arranged to run the computer program to perform the steps of any of the method embodiments described above.
Optionally, the electronic apparatus may further include a transmission device and an input/output device, where the transmission device is connected to the processor, and the input/output device is connected to the processor.
Alternatively, in the present embodiment, the above-described processor may be configured to execute the following steps by a computer program:
Acquiring website information of a target website;
Carrying out multidimensional association analysis on the website information to obtain asset data associated with the target website;
Vectorizing the asset data according to a predetermined website information index to obtain a multidimensional vector of the asset data;
And carrying out hierarchical clustering processing on the multidimensional vector to obtain clustering of the asset data, and determining the close-correlation degree of the asset data and the target website according to the clustering of the asset data.
It should be noted that, specific examples in this embodiment may refer to examples described in the foregoing embodiments and alternative implementations, and are not described in detail in this embodiment.
In one embodiment, a computer device is provided, which may be a terminal. The computer device includes a processor, a memory, a network interface, a display screen, and an input device connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement an operating system detection method. The display screen of the computer equipment can be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer equipment can be a touch layer covered on the display screen, can also be keys, a track ball or a touch pad arranged on the shell of the computer equipment, and can also be an external keyboard, a touch pad or a mouse and the like.
It will be appreciated by persons skilled in the art that the structures described above are block diagrams of only some of the structures associated with the present application and are not intended to limit the computer apparatus to which the present application is applied, and that a particular computer apparatus may include more or fewer components than those described above, or may combine certain components, or have different arrangements of components.
In addition, in combination with the website information clustering method provided in the above embodiment, a storage medium may be further provided to implement this embodiment. The storage medium has a computer program stored thereon; the computer program, when executed by a processor, implements any one of the website information clustering methods of the above embodiments.
It should be understood that the specific embodiments described herein are merely illustrative of this application and are not intended to be limiting. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the present disclosure, are within the scope of the present disclosure in accordance with the embodiments provided herein.
It is to be understood that the drawings are merely illustrative of some embodiments of the present application and that it is possible for those skilled in the art to adapt the present application to other similar situations without the need for inventive work. In addition, it should be appreciated that while the development effort might be complex and lengthy, it would nevertheless be a routine undertaking of design, fabrication, or manufacture for those of ordinary skill having the benefit of this disclosure, and thus should not be construed as a departure from the disclosure.
The term "embodiment" in this disclosure means that a particular feature, structure, or characteristic described in connection with the embodiment may be included in at least one embodiment of the application. The appearances of such phrases in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive. It will be clear or implicitly understood by those of ordinary skill in the art that the embodiments described in the present application can be combined with other embodiments without conflict.
The above examples merely represent a few embodiments of the present application, which are described in more detail and are not to be construed as limiting the scope of the patent claims. It should be noted that it will be apparent to those skilled in the art that several variations and modifications can be made without departing from the spirit of the application, which are all within the scope of the application. Accordingly, the scope of the application should be assessed as that of the appended claims.

Claims (8)

1. A method for clustering website information, comprising:
Acquiring website information of a target website;
performing multidimensional association analysis on the website information to obtain asset data associated with the target website;
vectorizing the asset data according to a predetermined website information index to obtain a multidimensional vector of the asset data;
performing hierarchical clustering processing on the multi-dimensional vector to obtain clusters of the asset data, and determining the close-contact degree of the asset data and the target website according to the clusters of the asset data; wherein:
The vectorizing the asset data according to the predetermined website information index to obtain a multidimensional vector of the asset data, including:
Vectorizing the asset data according to the similarity of the asset data with the website information on each predetermined website information index to obtain a multidimensional vector of the asset data;
performing hierarchical clustering processing on the multi-dimensional vector to obtain clustering of the asset data, wherein the clustering comprises the following steps:
Hierarchical clustering is carried out on the multi-dimensional vectors to obtain the distance between the multi-dimensional vectors;
And clustering the dimension vectors according to the distance between the dimension vectors to obtain the clustering of the asset data.
2. The method for clustering website information according to claim 1, wherein the performing multidimensional association analysis on the website information to obtain asset data associated with the target website comprises:
Extracting multidimensional information of the target website from the website information;
And respectively carrying out association analysis on the multi-dimensional information to obtain asset data associated with each piece of the multi-dimensional information.
3. The method for clustering website information according to claim 1, wherein the asset data includes first asset data and second asset data, the performing multidimensional association analysis on the website information to obtain asset data associated with the target website, further comprising:
Multidimensional association analysis is carried out on the website information to obtain first asset data directly associated with the website information;
And performing expansion association on the first asset data to obtain second asset data directly associated with the first asset data.
4. The web site information clustering method according to any one of claims 1 to 3, wherein the web site information of the target web site includes SSL certificate information, IPC record information, web page response information, and IP information.
5. A website information clustering device, comprising: the device comprises an acquisition module, an association module, a vectorization module and a clustering module, wherein:
the acquisition module is used for acquiring website information of a target website;
the association module is used for carrying out multidimensional association analysis on the website information to obtain asset data associated with the target website;
The vectorization module is used for vectorizing the asset data according to a predetermined website information index to obtain a multidimensional vector of the asset data;
the clustering module is used for carrying out hierarchical clustering processing on the multi-dimensional vectors to obtain clusters of the asset data, and determining the closeness of the asset data and the target website according to the clusters of the asset data;
the vectorization module is further used for vectorizing the asset data according to the similarity between the asset data and the website information on the predetermined website information indexes to obtain a multidimensional vector of the asset data;
the clustering module is further used for carrying out hierarchical clustering processing on the multi-dimensional vectors to obtain distances among the multi-dimensional vectors, and clustering the multi-dimensional vectors according to the distances among the multi-dimensional vectors to obtain the clustering of the asset data.
6. An electronic device comprising a memory and a processor, wherein the memory has stored therein a computer program, the processor being arranged to run the computer program to perform the website information clustering method of any one of claims 1 to 4.
7. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the steps of the website information clustering method of any one of claims 1 to 4 when the computer program is executed.
8. A computer readable storage medium having stored thereon a computer program, characterized in that the computer program when executed by a processor implements the steps of the website information clustering method of any one of claims 1 to 4.
CN202110791002.7A 2021-07-13 2021-07-13 Website information clustering method and device, electronic device and computer equipment Active CN113468391B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110791002.7A CN113468391B (en) 2021-07-13 2021-07-13 Website information clustering method and device, electronic device and computer equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110791002.7A CN113468391B (en) 2021-07-13 2021-07-13 Website information clustering method and device, electronic device and computer equipment

Publications (2)

Publication Number Publication Date
CN113468391A CN113468391A (en) 2021-10-01
CN113468391B true CN113468391B (en) 2024-05-28

Family

ID=77880044

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110791002.7A Active CN113468391B (en) 2021-07-13 2021-07-13 Website information clustering method and device, electronic device and computer equipment

Country Status (1)

Country Link
CN (1) CN113468391B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114615262B (en) * 2022-01-30 2024-05-14 阿里巴巴(中国)有限公司 Network aggregation method, storage medium, processor and system
CN114819971B (en) * 2022-04-22 2023-04-07 支付宝(杭州)信息技术有限公司 Wind control method based on multi-dimensional relational data, graph clustering method and device

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111444961A (en) * 2020-03-26 2020-07-24 国家计算机网络与信息安全管理中心黑龙江分中心 Method for judging internet website affiliation through clustering algorithm
CN111897962A (en) * 2020-07-27 2020-11-06 绿盟科技集团股份有限公司 Internet of things asset marking method and device
CN112003857A (en) * 2020-08-20 2020-11-27 深信服科技股份有限公司 Network asset collecting method, device, equipment and storage medium
CN112104656A (en) * 2020-09-16 2020-12-18 杭州安恒信息安全技术有限公司 Network threat data acquisition method, device, equipment and medium
CN112579848A (en) * 2020-12-10 2021-03-30 北京知道创宇信息技术股份有限公司 Website classification method and device, computer equipment and storage medium

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111444961A (en) * 2020-03-26 2020-07-24 国家计算机网络与信息安全管理中心黑龙江分中心 Method for judging internet website affiliation through clustering algorithm
CN111897962A (en) * 2020-07-27 2020-11-06 绿盟科技集团股份有限公司 Internet of things asset marking method and device
CN112003857A (en) * 2020-08-20 2020-11-27 深信服科技股份有限公司 Network asset collecting method, device, equipment and storage medium
CN112104656A (en) * 2020-09-16 2020-12-18 杭州安恒信息安全技术有限公司 Network threat data acquisition method, device, equipment and medium
CN112579848A (en) * 2020-12-10 2021-03-30 北京知道创宇信息技术股份有限公司 Website classification method and device, computer equipment and storage medium

Also Published As

Publication number Publication date
CN113468391A (en) 2021-10-01

Similar Documents

Publication Publication Date Title
CN113468391B (en) Website information clustering method and device, electronic device and computer equipment
Wang et al. Extreme clustering–a clustering method via density extreme points
Tzeng et al. Multidimensional scaling for large genomic data sets
Pan et al. A new fast search algorithm for exact k-nearest neighbors based on optimal triangle-inequality-based check strategy
CN111553215B (en) Personnel association method and device, graph roll-up network training method and device
Gao et al. Multi-view clustering with self-representation and structural constraint
Cho et al. Mode-seeking on graphs via random walks
CN110929080B (en) Optical remote sensing image retrieval method based on attention and generation countermeasure network
US20220300528A1 (en) Information retrieval and/or visualization method
Yao et al. Denoising protein–protein interaction network via variational graph auto-encoder for protein complex detection
CN113806582B (en) Image retrieval method, image retrieval device, electronic equipment and storage medium
Miao et al. Predicting human mobility via attentive convolutional network
Mukherjee et al. A bag of constrained informative deep visual words for image retrieval
Wu et al. Multi-label collective classification via markov chain based learning method
Yu et al. PTL-CFS based deep convolutional neural network model for remote sensing classification
Yuan et al. Research on the fusion method of spatial data and multimedia information of multimedia sensor networks in cloud computing environment
Takaishi et al. Free-form feature classification for finite element meshing based on shape descriptors and machine learning
Han et al. Grid graph-based large-scale point clouds registration
CN107563399A (en) The characteristic weighing Spectral Clustering and system of a kind of knowledge based entropy
CN112257807A (en) Dimension reduction method and system based on self-adaptive optimization linear neighborhood set selection
Guo et al. Adaptive graph contrastive learning for community detection
Dutta et al. A bag-of-paths based serialized subgraph matching for symbol spotting in line drawings
Song et al. AAM-ORB: affine attention module on ORB for conditioned feature matching
LI et al. Design of mixed data clustering algorithm based on density peak
CN116910186B (en) Text index model construction method, index method, system and terminal

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant