CN113468391A - Website information clustering method and device, electronic device and computer equipment - Google Patents

Website information clustering method and device, electronic device and computer equipment Download PDF

Info

Publication number
CN113468391A
CN113468391A CN202110791002.7A CN202110791002A CN113468391A CN 113468391 A CN113468391 A CN 113468391A CN 202110791002 A CN202110791002 A CN 202110791002A CN 113468391 A CN113468391 A CN 113468391A
Authority
CN
China
Prior art keywords
asset data
website
information
website information
clustering
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110791002.7A
Other languages
Chinese (zh)
Other versions
CN113468391B (en
Inventor
宋建昌
范渊
黄进
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
DBAPPSecurity Co Ltd
Original Assignee
DBAPPSecurity Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by DBAPPSecurity Co Ltd filed Critical DBAPPSecurity Co Ltd
Priority to CN202110791002.7A priority Critical patent/CN113468391B/en
Publication of CN113468391A publication Critical patent/CN113468391A/en
Application granted granted Critical
Publication of CN113468391B publication Critical patent/CN113468391B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/906Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/231Hierarchical techniques, i.e. dividing or merging pattern sets so as to obtain a dendrogram
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
    • G06Q50/10Services
    • G06Q50/26Government or public services

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Business, Economics & Management (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Tourism & Hospitality (AREA)
  • Databases & Information Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Economics (AREA)
  • General Business, Economics & Management (AREA)
  • Health & Medical Sciences (AREA)
  • Development Economics (AREA)
  • General Health & Medical Sciences (AREA)
  • Human Resources & Organizations (AREA)
  • Marketing (AREA)
  • Primary Health Care (AREA)
  • Strategic Management (AREA)
  • Educational Administration (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application relates to a website information clustering method, a website information clustering device, an electronic device and computer equipment, wherein the website information clustering method comprises the following steps: the method comprises the steps of obtaining website information of a target website, carrying out multi-dimensional correlation analysis on the website information to obtain asset data correlated with the target website, carrying out vectorization processing on the asset data according to a predetermined website information index to obtain a multi-dimensional vector of the asset data, carrying out hierarchical clustering processing on the multi-dimensional vector to obtain clustering of the asset data, and determining the intimacy degree of the asset data and the target website according to the clustering of the asset data. The method and the device realize the clustering of the asset data acquired by different dimensionalities of the website information through hierarchical clustering, further realize the calculation of the degree of closeness between a target website and an associated website, and further improve the accuracy of website association analysis.

Description

Website information clustering method and device, electronic device and computer equipment
Technical Field
The present application relates to the field of website expansion analysis technologies, and in particular, to a method and an apparatus for clustering website information, an electronic apparatus, and a computer device.
Background
With the development of science and technology, the phenomenon of internet crime also increases day by day, causing serious influence on the production and life of people. Since internet crimes are typically conducted in a group, there may be associated information between different illicit websites. At present, when an illegal website is searched, after a target illegal website is determined, the information of other websites related to the target illegal website is obtained by performing independent dimension analysis such as domain name analysis or IP information analysis on the target illegal website, and in such a way that related website information is obtained by expanding single-dimension information of the website, the independence of information of each dimension for related analysis is high, so the accuracy of a result of the related analysis is low.
Aiming at the problem of low accuracy of website associated information analysis in the related technology, no effective solution is provided at present.
Disclosure of Invention
The embodiment provides a website information clustering method, a website information clustering device, an electronic device and computer equipment, so as to solve the problem that the accuracy of website associated information analysis is low in the related art.
In a first aspect, in this embodiment, a website information clustering method is provided, including:
acquiring website information of a target website;
performing multidimensional association analysis on the website information to obtain asset data associated with the target website;
vectorizing the asset data according to a predetermined website information index to obtain a multi-dimensional vector of the asset data;
and performing hierarchical clustering processing on the multi-dimensional vectors to obtain clusters of the asset data, and determining the intimacy degree of the asset data and the target website according to the clusters of the asset data.
In some embodiments, the performing multidimensional association analysis on the website information to obtain asset data associated with the target website includes:
extracting multi-dimensional information of the target website from the website information;
and respectively carrying out association analysis on the multi-dimensional information to obtain asset data associated with each dimension information in the multi-dimensional information.
In some embodiments, the asset data includes first asset data and second asset data, and performing multidimensional association analysis on the website information to obtain asset data associated with the target website further includes:
performing multidimensional association analysis on the website information to obtain first asset data directly associated with the website information;
and carrying out extended association on the first asset data to obtain second asset data directly associated with the first asset data.
In some embodiments, the vectorizing the asset data according to a predetermined website information index to obtain a multidimensional vector of the asset data includes:
and vectorizing the asset data according to the similarity between the asset data and the website information on the predetermined website information indexes to obtain a multi-dimensional vector of the asset data.
In some embodiments, the performing hierarchical clustering on the multidimensional vector to obtain a cluster of the asset data includes:
performing hierarchical clustering processing on the multi-dimensional vectors to obtain the distance between the multi-dimensional vectors;
and clustering the dimensional vectors according to the distance between the multi-dimensional vectors to obtain the cluster of the asset data.
In some embodiments, the website information of the target website includes SSL certificate information, IPC registration information, web page response information, and IP information.
In a second aspect, in this embodiment, there is provided a website information clustering apparatus, including: the device comprises an acquisition module, an association module, a vectorization module and a clustering module, wherein:
the acquisition module is used for acquiring website information of a target website;
the association module is used for carrying out multidimensional association analysis on the website information to obtain asset data associated with the target website;
the vectorization module is used for vectorizing the asset data according to a predetermined website information index to obtain a multi-dimensional vector of the asset data;
and the clustering module is used for carrying out hierarchical clustering processing on the multi-dimensional vectors to obtain clusters of the asset data, and determining the intimacy degree between the asset data and the target website according to the clusters of the asset data.
In a third aspect, in this embodiment, an electronic device is provided, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, and when the processor executes the computer program, the website information clustering method according to the first aspect is implemented.
In a fourth aspect, in this embodiment, there is provided a computer device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor implements the website information clustering method according to the first aspect when executing the computer program.
In a fifth aspect, in the present embodiment, there is provided a storage medium, on which a computer program is stored, which when executed by a processor, implements the website information clustering method described in the first aspect above.
The website information clustering method, the website information clustering device, the electronic device and the computer equipment acquire website information of a target website, perform multidimensional association analysis on the website information to obtain asset data associated with the target website, perform vectorization processing on the asset data according to a predetermined website information index to obtain a multidimensional vector of the asset data, perform hierarchical clustering processing on the multidimensional vector to obtain a cluster of the asset data, and determine the intimacy degree between the asset data and the target website according to the cluster of the asset data. The method and the device realize the clustering of the asset data acquired by different dimensionalities of the website information through hierarchical clustering, further realize the calculation of the degree of closeness between a target website and an associated website, and further improve the accuracy of website association analysis.
The details of one or more embodiments of the application are set forth in the accompanying drawings and the description below to provide a more thorough understanding of the application.
Drawings
The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the application and together with the description serve to explain the application and not to limit the application. In the drawings:
fig. 1 is a block diagram of a hardware structure of a terminal of a website information clustering method of the related art;
FIG. 2 is a flowchart illustrating a website information clustering method according to this embodiment;
FIG. 3 is a flowchart illustrating another website information clustering method according to this embodiment;
fig. 4 is a schematic structural diagram of the website information clustering device according to the embodiment.
Detailed Description
For a clearer understanding of the objects, aspects and advantages of the present application, reference is made to the following description and accompanying drawings.
Unless defined otherwise, technical or scientific terms used herein shall have the same general meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The use of the terms "a" and "an" and "the" and similar referents in the context of this application do not denote a limitation of quantity, either in the singular or the plural. The terms "comprises," "comprising," "has," "having," and any variations thereof, as referred to in this application, are intended to cover non-exclusive inclusions; for example, a process, method, and system, article, or apparatus that comprises a list of steps or modules (elements) is not limited to the listed steps or modules, but may include other steps or modules (elements) not listed or inherent to such process, method, article, or apparatus. Reference throughout this application to "connected," "coupled," and the like is not limited to physical or mechanical connections, but may include electrical connections, whether direct or indirect. Reference to "a plurality" in this application means two or more. "and/or" describes an association relationship of associated objects, meaning that three relationships may exist, for example, "A and/or B" may mean: a exists alone, A and B exist simultaneously, and B exists alone. In general, the character "/" indicates a relationship in which the objects associated before and after are an "or". The terms "first," "second," "third," and the like in this application are used for distinguishing between similar items and not necessarily for describing a particular sequential or chronological order.
The method embodiments provided in the present embodiment may be executed in a terminal, a computer, or a similar computing device. For example, the method is executed on a terminal, and fig. 1 is a block diagram of a hardware structure of the terminal of the website information clustering method according to the embodiment. As shown in fig. 1, the terminal may include one or more processors 102 (only one shown in fig. 1) and a memory 104 for storing data, wherein the processor 102 may include, but is not limited to, a processing device such as a microprocessor MCU or a programmable logic device FPGA. The terminal may also include a transmission device 106 for communication functions and an input-output device 108. It will be understood by those of ordinary skill in the art that the structure shown in fig. 1 is merely an illustration and is not intended to limit the structure of the terminal described above. For example, the terminal may also include more or fewer components than shown in FIG. 1, or have a different configuration than shown in FIG. 1.
The memory 104 may be used to store a computer program, for example, a software program and a module of application software, such as a computer program corresponding to the website information clustering method in the present embodiment, and the processor 102 executes various functional applications and data processing by running the computer program stored in the memory 104, so as to implement the method described above. The memory 104 may include high speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some examples, the memory 104 may further include memory located remotely from the processor 102, which may be connected to the terminal over a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
The transmission device 106 is used to receive or transmit data via a network. The network described above includes a wireless network provided by a communication provider of the terminal. In one example, the transmission device 106 includes a Network adapter (NIC) that can be connected to other Network devices through a base station to communicate with the internet. In one example, the transmission device 106 may be a Radio Frequency (RF) module, which is used to communicate with the internet in a wireless manner.
Fig. 2 is a flowchart of the website information clustering method in this embodiment, and as shown in fig. 2, the flowchart includes the following steps:
step S210, acquiring website information of the target website.
The target website can be a found illegal website. In order to obtain other website information associated with the target website, multi-dimensional website information of the target website may be obtained first. Specifically, the website information of the target website may include SSL (Secure socket layer) certificate, ICP (Internet Content Provider) filing information, web page response information, IP (Internet Protocol) information, and domain name information of the target website.
Step S220, carrying out multidimensional association analysis on the website information to obtain asset data associated with the target website.
Specifically, if the target website includes the SSL certificate, the asset data that is the same as the serial number of the SSL certificate may be obtained from the domain name asset library based on the SSL certificate of the target website. In addition, the asset data with the same ICP record number can be obtained from the domain name asset library according to the ICP record number of the target website. And asset data similar to the response result of the target website can be obtained from the domain name asset library by utilizing simhash calculation. And corresponding asset data can be obtained according to the IP address in the record A analyzed by the dns. In addition, asset data corresponding to IP addresses of the target website and the C terminal can be acquired according to a dns resolution technology. And when the asset data of the target website is acquired, webpage response information and IP information of the asset data can be acquired.
Additionally, after the asset data directly associated with the target website is obtained according to the website information of the target website, the asset data associated with the asset data directly associated with the target website can be further obtained. Specifically, after the asset data directly associated with the target website is obtained according to the SSL certificate, the ICP docket information, and the web page response information of the target website, dns resolution may be further performed on the asset data directly associated with the target website to obtain corresponding asset data.
The asset data acquired based on the website information of the target website are associated with the target website in different dimensions, so that a set of asset data associated with the target website in each dimension is formed, and the method can be used for extracting the subsequent multidimensional indexes and calculating the affinity degree associated with the target website.
Step S230, performing vectorization processing on the asset data according to the predetermined website information index to obtain a multidimensional vector of the asset data.
The website information index may be determined according to an actual application scenario and web content of a target website, for example, the website information index may be a website title, a web simhash value, an IP address, an ICP docket number, contact information, SSL certificate information, and the like. After the website information indexes are determined, vectorization processing is performed on the asset data according to the association relationship between the asset data associated with the target website and the target website on each website information index determined in the step S220, so as to obtain the multidimensional vector corresponding to the asset data. Specifically, the value of the asset data on each website information index may be determined according to whether the asset data is equal to the data of the target website on the corresponding index on each website information index, for example, on indexes such as a website title, a webpage simhash value, an IP address, an ICP docket number, contact information, and SSL certificate information, so as to obtain the multidimensional vector of the asset data.
Further, the value of the asset data on each website information index is determined according to the association relationship between the asset data on each website information index and the target website, and may be determined according to the preset weight for each website information index and the association process for acquiring the asset data. For example, if the property data a is identical to the target website in the website title, the value of the property data a in the website information index of the website title is 0. The data of the asset data A on the website information index of the SSL certificate information is the same as that of the target website, the weight of the SSL certificate information is m, the asset data A is obtained by the target website through association according to the IP address information of the asset data B after the asset data B is obtained through association according to the webpage response information, and the value of the asset data A on the index of the SSL certificate information can be set to be m2. In addition, because the importance degrees of different website information indexes on the website association relationship may be different, for example, the affinity of two websites with the same website simhash value is different from that of the website simhash value, but the affinity of two websites with the same IP segment is higher, so the weight set by each website information index may also be different.
The asset data is vectorized to obtain the multidimensional vector of the asset data, the numerical values of the asset data related to the target website on each website information index can be obtained, the abstract association relation between each asset data and the target website is quantized into specific numerical values, basic data are provided for the follow-up calculation of the affinity of each asset data and the target website, the numerical values of different indexes form the multidimensional vector, and compared with the prior method for analyzing the independent dimension of the website information of the target website, more comprehensive data can be provided for the follow-up affinity determination, and the accuracy of the affinity calculation is improved.
And S240, performing hierarchical clustering processing on the multi-dimensional vectors to obtain clusters of the asset data, and determining the intimacy degree between the asset data and the target website according to the clusters of the asset data.
The hierarchical clustering is an unsupervised learning algorithm, the similarity between nodes is calculated through certain similarity measure, and the nodes are gradually reconnected according to the sequence of similarity from high to low. The hierarchical clustering algorithm is applied to the processing of the multi-dimensional vectors, and the distance between the multi-dimensional vectors corresponding to the asset data can be obtained, wherein the smaller the calculated distance between the two asset data is, the higher the similarity between the two asset data is. After the distances between different multi-dimensional vectors are obtained, two multi-dimensional vectors with the shortest distance can be combined, and therefore the clustering tree is generated. After continuous calculation and clustering, the cluster of the asset data can be obtained, wherein the cluster of the asset data can be one or more. It will be appreciated that the above-described,
in addition, a cluster group to which the target website belongs can be determined, and the intimacy degree between other asset data and the target website can be analyzed according to the distance between the other asset data in the cluster group and the target website. In addition, asset data in other cluster groups may be less closely associated with the target web site than asset data of the target web site in the same cluster group. Therefore, after the clustering of the asset data is obtained, the intimacy degree between different asset data and the target website can be analyzed and obtained.
In the above steps S210 to S240, the website information of the target website is obtained, the website information is subjected to multidimensional association analysis to obtain asset data associated with the target website, the asset data is subjected to vectorization processing according to a predetermined website information index to obtain a multidimensional vector of the asset data, the multidimensional vector is subjected to hierarchical clustering processing to obtain a cluster of the asset data, and the degree of intimacy between the asset data and the target website is determined according to the cluster of the asset data. The method and the device realize the clustering of the asset data acquired by different dimensionalities of the website information through hierarchical clustering, further realize the calculation of the degree of closeness between a target website and an associated website, and further improve the accuracy of website association analysis.
Further, in an embodiment, based on the step S220, performing multidimensional association analysis on the website information to obtain asset data associated with the target website, which specifically includes the following steps:
step S221, extracting multi-dimensional information of the target website from the website information.
Specifically, multidimensional information of the target website, such as SSL certificates, ICP docket information, web page response information, IP information, domain name information, and the like of the target website, is respectively obtained from the website information.
Step S222, performing association analysis on the multi-dimensional information respectively to obtain asset data associated with each piece of dimensional information in the multi-dimensional information.
Specifically, association analysis may be performed on the SSL certificate, ICP docket information, web page response information, IP information, domain name information, and the like of the target website, respectively, to obtain asset data associated with each dimension information in the multi-dimension information.
The multi-dimensional information of the target website is extracted from the website information, and the multi-dimensional information is subjected to association analysis respectively to obtain asset data associated with each piece of dimensional information in the multi-dimensional information, so that the asset data associated with the multi-dimensional information of the target website can be obtained, more comprehensive associated asset data of the target website can be obtained, and the efficiency of searching for the associated website based on the target website is improved.
Additionally, in an embodiment, based on the step S220, the asset data includes first asset data and second asset data, and the multidimensional association analysis is performed on the website information to obtain the asset data associated with the target website, which specifically includes the following steps:
step S223, performing multidimensional association analysis on the website information to obtain first asset data directly associated with the website information.
The first asset data is asset data directly associated with a target website, which is obtained by directly performing multidimensional association analysis on website information of the target website.
Step S224, the first asset data is subjected to extended association, and second asset data directly associated with the first asset data is obtained.
The second asset data is obtained by performing operations such as dns on the first asset data after the first asset data is obtained, and thus the second asset data is directly associated with the first asset data and indirectly associated with the target website.
By acquiring the first asset data directly associated with the target website and the second asset data directly associated with the first asset data, the efficiency of searching for the associated website based on the target website can be improved.
Additionally, in an embodiment, based on the step S230, the vectorizing processing is performed on the asset data according to the predetermined website information index to obtain the multidimensional vector of the asset data, which specifically includes the following steps:
step S231, according to the similarity between the asset data and the website information on each website information index determined in advance, performing vectorization processing on the asset data to obtain a multidimensional vector of the asset data.
Additionally, in an embodiment, based on the step S240, performing hierarchical clustering processing on the multidimensional vector to obtain a cluster of the asset data, specifically including the following steps:
step S241, perform hierarchical clustering processing on the multidimensional vectors to obtain distances between the multidimensional vectors.
For example, the multidimensional vectors corresponding to the 4 asset data are X1(a1, b1, c1, d1, e1, f1, g1, h1), X2(a2, b2, c2, d2, e2, f2, g2, h2), X3(a3, b3, c3, d3, e3, f3, g3, h3), and X4(a4, b4, c4, d4, e4, f4, g4, h4), respectively. And calculating the distances between the multidimensional vector corresponding to each asset data and all other multidimensional vectors in the 4 asset data through a hierarchical clustering algorithm to determine the similarity among the asset data. Specifically, taking the calculation of the distance between the multidimensional vectors X1 and X2 as an example, in combination with equation (1), the distance D (X1, X2) between X1 and X2 can be obtained as:
Figure BDA0003160921410000081
wherein the denominator in equation (1) may be the same as the dimension of the multi-dimensional vector.
And step S242, clustering the dimensional vectors according to the distance between the multi-dimensional vectors to obtain the cluster of the asset data.
After the distances between the multidimensional vectors are obtained, namely D (X1, X2), D (X1, X3), D (X1, X4), D (X2, X3), D (X2, X4) and D (X3, X4), the distances from other multidimensional vectors to the vector combination are calculated from the vector combination with the minimum distance value, a new vector combination is obtained according to the minimum distances from other multidimensional vectors to the vector combination, and the clustering of the asset data is obtained through continuous calculation. Wherein, with the minimum distance D (X1, X2), the distance from the multidimensional vector X3 to the vector combination (X1, X2) is described in conjunction with formula (2):
Figure BDA0003160921410000082
further, in one embodiment, the website information of the target website includes SSL certificate information, IPC registration information, web page response information, and IP information.
In the above steps S210 to S242, the multi-dimensional information of the target website is extracted from the website information, the multi-dimensional information is subjected to association analysis respectively to obtain asset data associated with each piece of dimensional information in the multi-dimensional information, the website information is subjected to multi-dimensional association analysis to obtain first asset data directly associated with the website information, the first asset data is subjected to extended association to obtain second asset data directly associated with the first asset data, more comprehensive associated asset data of the target website can be obtained, and therefore efficiency of searching for an associated website based on the target website is improved; according to the similarity between the asset data and the website information on the predetermined website information indexes, vectorizing the asset data to obtain a multi-dimensional vector of the asset data, and quantizing the abstract association relationship between the asset data and a target website into a specific numerical value; and carrying out hierarchical clustering processing on the multi-dimensional vectors to obtain the distance between the multi-dimensional vectors, and clustering the multi-dimensional vectors according to the distance between the multi-dimensional vectors to obtain the cluster of the asset data, so that the calculation of the intimacy degree of the asset data and the target website is realized based on the hierarchical clustering, and the accuracy of the website association analysis is improved.
The present embodiment is described and illustrated below by means of preferred embodiments.
Fig. 3 is a flowchart of the website information clustering method according to the preferred embodiment. As shown in fig. 3, the method comprises the following steps:
step S310, collecting data of a target website;
step S321, acquiring corresponding asset data, response body information and IP address information according to the SSL certificate of the data of the target website;
step S322, acquiring corresponding asset data, response body information and IP address information according to the ICP record number of the data of the target website;
step S323, carrying out simhash calculation on the response result of the target website to obtain corresponding asset data, response body information and IP address information;
step S324, processing the data of the target website by utilizing dns resolution, and obtaining corresponding asset data, response body information and IP address information according to the IP address in the record A resolved by the dns;
step S325, according to the IP address resolved by the dns, asset data, response body information and IP address information corresponding to the IP address on the C section are obtained;
step S326, performing the expansion processing of step 325 and step S326 on the asset data obtained in steps S321, S322, and S323, and further obtaining other asset data;
step S330, vectorizing the asset data based on the multi-dimensional indexes;
step S340, analyzing the vector corresponding to the asset data by utilizing hierarchical clustering to obtain one or more clusters;
and step S350, obtaining the intimacy degree of each asset data and the target website based on each cluster.
It should be noted that the steps illustrated in the above-described flow diagrams or in the flow diagrams of the figures may be performed in a computer system, such as a set of computer-executable instructions, and that, although a logical order is illustrated in the flow diagrams, in some cases, the steps illustrated or described may be performed in an order different than here. For example, the execution order of steps S321, S322, and S323 may be interchanged.
The present embodiment further provides a website information clustering device, which is used to implement the foregoing embodiments and preferred embodiments, and the description of the device is omitted here. The terms "module," "unit," "subunit," and the like as used below may implement a combination of software and/or hardware for a predetermined function. Although the means described in the embodiments below are preferably implemented in software, an implementation in hardware, or a combination of software and hardware is also possible and contemplated.
Fig. 4 is a block diagram of a website information clustering device 40 according to the present embodiment, and as shown in fig. 4, the website information clustering device 40 includes: an acquisition module 42, an association module 44, a vectorization module 46, and a clustering module 48, wherein:
an obtaining module 42, configured to obtain website information of a target website;
the association module 44 is configured to perform multidimensional association analysis on the website information to obtain asset data associated with the target website;
the vectorization module 46 is configured to perform vectorization processing on the asset data according to a predetermined website information index to obtain a multidimensional vector of the asset data;
and the clustering module 48 is used for performing hierarchical clustering processing on the multi-dimensional vectors to obtain clusters of the asset data, and determining the intimacy degree between the asset data and the target website according to the clusters of the asset data.
The website information clustering device 40 obtains website information of a target website, performs multidimensional association analysis on the website information to obtain asset data associated with the target website, performs vectorization processing on the asset data according to a predetermined website information index to obtain a multidimensional vector of the asset data, performs hierarchical clustering processing on the multidimensional vector to obtain clustering of the asset data, and determines the degree of intimacy between the asset data and the target website according to the clustering of the asset data. The method and the device realize the clustering of the asset data acquired by different dimensionalities of the website information through hierarchical clustering, further realize the calculation of the degree of closeness between a target website and an associated website, and further improve the accuracy of website association analysis.
In one embodiment, the association module 44 is further configured to extract multidimensional information of the target website from the website information, and perform association analysis on the multidimensional information respectively to obtain asset data associated with each dimension information in the multidimensional information.
In one embodiment, the association module 44 is further configured to perform multidimensional association analysis on the website information to obtain first asset data directly associated with the website information, and perform extended association on the first asset data to obtain second asset data directly associated with the first asset data.
In one embodiment, the vectorization module 46 is further configured to perform vectorization processing on the asset data according to similarity between the asset data and website information on each predetermined website information index, so as to obtain a multidimensional vector of the asset data.
In one embodiment, the clustering module 48 is further configured to perform hierarchical clustering on the multidimensional vectors to obtain distances between the multidimensional vectors, and perform clustering on the multidimensional vectors according to the distances between the multidimensional vectors to obtain clusters of the asset data.
In one embodiment, the website information of the target website includes SSL certificate information, IPC registration information, web page response information, and IP information.
The above modules may be functional modules or program modules, and may be implemented by software or hardware. For a module implemented by hardware, the modules may be located in the same processor; or the modules can be respectively positioned in different processors in any combination.
There is also provided in this embodiment an electronic device comprising a memory having a computer program stored therein and a processor arranged to run the computer program to perform the steps of any of the above method embodiments.
Optionally, the electronic apparatus may further include a transmission device and an input/output device, wherein the transmission device is connected to the processor, and the input/output device is connected to the processor.
Optionally, in this embodiment, the processor may be configured to execute the following steps by a computer program:
acquiring website information of a target website;
performing multidimensional association analysis on the website information to obtain asset data associated with a target website;
vectorizing the asset data according to a predetermined website information index to obtain a multi-dimensional vector of the asset data;
and performing hierarchical clustering processing on the multi-dimensional vectors to obtain clusters of the asset data, and determining the intimacy degree of the asset data and the target website according to the clusters of the asset data.
It should be noted that, for specific examples in this embodiment, reference may be made to the examples described in the foregoing embodiments and optional implementations, and details are not described again in this embodiment.
In one embodiment, a computer device is provided, which may be a terminal. The computer device includes a processor, a memory, a network interface, a display screen, and an input device connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement an operating system detection method. The display screen of the computer equipment can be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer equipment can be a touch layer covered on the display screen, a key, a track ball or a touch pad arranged on the shell of the computer equipment, an external keyboard, a touch pad or a mouse and the like.
Those skilled in the art will appreciate that the above-described architecture is merely a block diagram of some of the structures associated with the present aspects and is not intended to limit the computing devices to which the present aspects apply, as particular computing devices may include more or less components than those described, or may combine certain components, or have a different arrangement of components.
In addition, in combination with the website information clustering method provided in the foregoing embodiment, a storage medium may also be provided in this embodiment to implement the method. The storage medium having stored thereon a computer program; the computer program, when executed by a processor, implements any one of the above-described website information clustering methods in the embodiments.
It should be understood that the specific embodiments described herein are merely illustrative of this application and are not intended to be limiting. All other embodiments, which can be derived by a person skilled in the art from the examples provided herein without any inventive step, shall fall within the scope of protection of the present application.
It is obvious that the drawings are only examples or embodiments of the present application, and it is obvious to those skilled in the art that the present application can be applied to other similar cases according to the drawings without creative efforts. Moreover, it should be appreciated that in the development of any such actual implementation, as in any engineering or design project, numerous implementation-specific decisions must be made to achieve the developers' specific goals, such as compliance with system-related and business-related constraints, which may vary from one implementation to another.
The term "embodiment" is used herein to mean that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the present application. The appearances of such phrases in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. It is to be expressly or implicitly understood by one of ordinary skill in the art that the embodiments described in this application may be combined with other embodiments without conflict.
The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the patent protection. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present application shall be subject to the appended claims.

Claims (10)

1. A website information clustering method is characterized by comprising the following steps:
acquiring website information of a target website;
performing multidimensional association analysis on the website information to obtain asset data associated with the target website;
vectorizing the asset data according to a predetermined website information index to obtain a multi-dimensional vector of the asset data;
and performing hierarchical clustering processing on the multi-dimensional vectors to obtain clusters of the asset data, and determining the intimacy degree of the asset data and the target website according to the clusters of the asset data.
2. The website information clustering method according to claim 1, wherein the performing multidimensional association analysis on the website information to obtain asset data associated with the target website comprises:
extracting multi-dimensional information of the target website from the website information;
and respectively carrying out association analysis on the multi-dimensional information to obtain asset data associated with each dimension information in the multi-dimensional information.
3. The website information clustering method according to claim 1, wherein the asset data includes first asset data and second asset data, and the multidimensional association analysis is performed on the website information to obtain asset data associated with the target website, further comprising:
performing multidimensional association analysis on the website information to obtain first asset data directly associated with the website information;
and carrying out extended association on the first asset data to obtain second asset data directly associated with the first asset data.
4. The website information clustering method according to claim 1, wherein the vectorizing the asset data according to a predetermined website information index to obtain a multidimensional vector of the asset data comprises:
and vectorizing the asset data according to the similarity between the asset data and the website information on the predetermined website information indexes to obtain a multi-dimensional vector of the asset data.
5. The website information clustering method according to claim 1, wherein the performing hierarchical clustering processing on the multidimensional vectors to obtain clusters of the asset data comprises:
performing hierarchical clustering processing on the multi-dimensional vectors to obtain the distance between the multi-dimensional vectors;
and clustering the dimensional vectors according to the distance between the multi-dimensional vectors to obtain the cluster of the asset data.
6. The website information clustering method according to any one of claims 1 to 5, wherein the website information of the target website includes SSL certificate information, IPC docket information, web page response information, and IP information.
7. A website information clustering apparatus, comprising: the device comprises an acquisition module, an association module, a vectorization module and a clustering module, wherein:
the acquisition module is used for acquiring website information of a target website;
the association module is used for carrying out multidimensional association analysis on the website information to obtain asset data associated with the target website;
the vectorization module is used for vectorizing the asset data according to a predetermined website information index to obtain a multi-dimensional vector of the asset data;
and the clustering module is used for carrying out hierarchical clustering processing on the multi-dimensional vectors to obtain clusters of the asset data, and determining the intimacy degree between the asset data and the target website according to the clusters of the asset data.
8. An electronic device comprising a memory and a processor, wherein the memory stores a computer program, and the processor is configured to execute the computer program to perform the website information clustering method according to any one of claims 1 to 6.
9. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the steps of the website information clustering method according to any one of claims 1 to 6 when executing the computer program.
10. A computer-readable storage medium, on which a computer program is stored, wherein the computer program, when being executed by a processor, implements the steps of the website information clustering method according to any one of claims 1 to 6.
CN202110791002.7A 2021-07-13 2021-07-13 Website information clustering method and device, electronic device and computer equipment Active CN113468391B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110791002.7A CN113468391B (en) 2021-07-13 2021-07-13 Website information clustering method and device, electronic device and computer equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110791002.7A CN113468391B (en) 2021-07-13 2021-07-13 Website information clustering method and device, electronic device and computer equipment

Publications (2)

Publication Number Publication Date
CN113468391A true CN113468391A (en) 2021-10-01
CN113468391B CN113468391B (en) 2024-05-28

Family

ID=77880044

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110791002.7A Active CN113468391B (en) 2021-07-13 2021-07-13 Website information clustering method and device, electronic device and computer equipment

Country Status (1)

Country Link
CN (1) CN113468391B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114615262A (en) * 2022-01-30 2022-06-10 阿里巴巴(中国)有限公司 Network aggregation method, storage medium, processor and system
CN114819971A (en) * 2022-04-22 2022-07-29 支付宝(杭州)信息技术有限公司 Wind control method based on multi-dimensional relational data, graph clustering method and device

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111444961A (en) * 2020-03-26 2020-07-24 国家计算机网络与信息安全管理中心黑龙江分中心 Method for judging internet website affiliation through clustering algorithm
CN111897962A (en) * 2020-07-27 2020-11-06 绿盟科技集团股份有限公司 Internet of things asset marking method and device
CN112003857A (en) * 2020-08-20 2020-11-27 深信服科技股份有限公司 Network asset collecting method, device, equipment and storage medium
CN112104656A (en) * 2020-09-16 2020-12-18 杭州安恒信息安全技术有限公司 Network threat data acquisition method, device, equipment and medium
CN112579848A (en) * 2020-12-10 2021-03-30 北京知道创宇信息技术股份有限公司 Website classification method and device, computer equipment and storage medium

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111444961A (en) * 2020-03-26 2020-07-24 国家计算机网络与信息安全管理中心黑龙江分中心 Method for judging internet website affiliation through clustering algorithm
CN111897962A (en) * 2020-07-27 2020-11-06 绿盟科技集团股份有限公司 Internet of things asset marking method and device
CN112003857A (en) * 2020-08-20 2020-11-27 深信服科技股份有限公司 Network asset collecting method, device, equipment and storage medium
CN112104656A (en) * 2020-09-16 2020-12-18 杭州安恒信息安全技术有限公司 Network threat data acquisition method, device, equipment and medium
CN112579848A (en) * 2020-12-10 2021-03-30 北京知道创宇信息技术股份有限公司 Website classification method and device, computer equipment and storage medium

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114615262A (en) * 2022-01-30 2022-06-10 阿里巴巴(中国)有限公司 Network aggregation method, storage medium, processor and system
CN114615262B (en) * 2022-01-30 2024-05-14 阿里巴巴(中国)有限公司 Network aggregation method, storage medium, processor and system
CN114819971A (en) * 2022-04-22 2022-07-29 支付宝(杭州)信息技术有限公司 Wind control method based on multi-dimensional relational data, graph clustering method and device

Also Published As

Publication number Publication date
CN113468391B (en) 2024-05-28

Similar Documents

Publication Publication Date Title
Zhang et al. Scalable supervised asymmetric hashing with semantic and latent factor embedding
Alguliyev et al. Efficient algorithm for big data clustering on single machine
CN106570141B (en) Approximate repeated image detection method
CN113468391A (en) Website information clustering method and device, electronic device and computer equipment
US8489589B2 (en) Visual search reranking
Nhamo et al. Using ICT indicators to measure readiness of countries to implement Industry 4.0 and the SDGs
CN102053992A (en) Clustering method and system
Xuan et al. Voronoi-based multi-level range search in mobile navigation
CN103902988A (en) Method for rough draft shape matching based on Modular product graph and maximum clique
Gu et al. Module overlapping structure detection in PPI using an improved link similarity-based Markov clustering algorithm
CN110162637B (en) Information map construction method, device and equipment
Li et al. Scalable Graph500 design with MPI-3 RMA
CN107451461B (en) Equipment fingerprint processing method and device of mobile equipment, server and storage medium
Li et al. LSEC: Large-scale spectral ensemble clustering
Tang et al. Collaborative Filtering Recommendation Using Nonnegative Matrix Factorization in GPU‐Accelerated Spark Platform
Tsai et al. Graphics processing unit‐accelerated multi‐resolution exhaustive search algorithm for real‐time keypoint descriptor matching in high‐dimensional spaces
CN113792170B (en) Graph data dividing method and device and computer equipment
CN112561412B (en) Method, device, server and storage medium for determining target object identifier
CN108304453A (en) A kind of determination method and device of video relevant search word
Zhang et al. Overlapping communities from dense disjoint and high total degree clusters
Liu et al. Community detection in multi-partite multi-relational networks based on information compression
CN107886100B (en) BRIEF feature descriptor based on rank array
Lee et al. Mining statistically significant attribute associations in attributed graphs
Shrivastava et al. A new mathematical space for social networks
Ahmed et al. The swap matching problem revisited

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant