CN110807487A - Method and device for identifying user based on domain name system flow record data - Google Patents

Method and device for identifying user based on domain name system flow record data Download PDF

Info

Publication number
CN110807487A
CN110807487A CN201911053713.3A CN201911053713A CN110807487A CN 110807487 A CN110807487 A CN 110807487A CN 201911053713 A CN201911053713 A CN 201911053713A CN 110807487 A CN110807487 A CN 110807487A
Authority
CN
China
Prior art keywords
cluster
preset time
vector
target
feature
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201911053713.3A
Other languages
Chinese (zh)
Other versions
CN110807487B (en
Inventor
李丹丹
黄小红
张沛
谢坤
马丰媛
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing University of Posts and Telecommunications
Original Assignee
Beijing University of Posts and Telecommunications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing University of Posts and Telecommunications filed Critical Beijing University of Posts and Telecommunications
Priority to CN201911053713.3A priority Critical patent/CN110807487B/en
Publication of CN110807487A publication Critical patent/CN110807487A/en
Application granted granted Critical
Publication of CN110807487B publication Critical patent/CN110807487B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2411Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Data Exchanges In Wide-Area Networks (AREA)

Abstract

The embodiment of the invention provides a method and a device for identifying a user based on domain name system flow record data, wherein the method comprises the following steps: the method comprises the steps of obtaining network behavior information corresponding to a plurality of internet protocol addresses in a plurality of preset time periods, constructing a characteristic vector according to the characteristic information, obtaining a plurality of characteristic vectors, clustering all the characteristic vectors based on the characteristic vector obtained in each preset time period to obtain a plurality of candidate clustering results, determining a target clustering result from the candidate clustering results, determining a user identifier corresponding to each cluster in the target clustering result, obtaining target network behavior information of a user to be identified, constructing a target characteristic vector, determining a target cluster corresponding to the target characteristic vector, and determining the user identifier corresponding to the target cluster as the user identifier of the user to be identified. The feature vector can be constructed based on the acquired network behavior information, clustering is carried out, the periodic rule of the user network behavior information is determined, and the user can be identified under the condition that the IP address of the user is continuously changed.

Description

Method and device for identifying user based on domain name system flow record data
Technical Field
The invention relates to the technical field of data mining, in particular to a method and a device for identifying a user based on domain name system flow record data.
Background
At present, network communication is mainly performed based on IP (Internet Protocol), and almost all IP-based network communication resolves a Domain Name through a DNS (Domain Name System) to access a destination IP, so that a user's network communication generates a large amount of DNS log records reflecting the user's access behavior to a website, and the user's network communication behavior has certain periodic characteristics, which may show some intentions and interests of the user. If DNS log records in a period of time are obtained, the data mining is carried out on the DNS log records, so that the periodic characteristics of network communication behaviors of users can be found, and different users can be identified according to the periodic characteristics.
The data mining technology mainly comprises three steps of data construction, rule searching and rule representation. Specifically, data construction is to integrate data required by extracting the data in a data source into a data set for mining. The rule searching means that the corresponding rule in the data set is found out through data mining. The rule representation is to represent the discovered rules in a way that can be understood by the user.
Existing methods for identifying users through data mining techniques, such as bayesian methods, mainly rely on supervised machine learning techniques that require sample construction. These methods require obtaining a large number of DNS log records for network communications for each identified user, i.e., they require long-term prior tracking of the identified users to accumulate a large number of samples.
However, with the use of more and more mobile devices and the use of protocols such as a Dynamic Host Configuration Protocol (DHCP) that can change the IP address of a user, the IP address of the user may be changed periodically, which makes it difficult to obtain network behavior information of the identity-determined user from the obtained DNS log record through a fixed IP address identifier, and further makes it difficult to perform data mining to determine the periodic characteristics of the user in network communication, and thus cannot perform user identification.
Disclosure of Invention
The embodiment of the invention aims to provide a method and a device for identifying a user based on domain name system flow record data, so as to realize the identification of the user to the maximum extent under the condition that network communication information of the user cannot be obtained and determined. The specific technical scheme is as follows:
in a first aspect, an embodiment of the present invention provides a method for identifying a user based on domain name system traffic record data, where the method includes:
acquiring network behavior information corresponding to a plurality of internet protocol addresses in a plurality of preset time periods, wherein the internet protocol address used by each user is unchanged in each preset time period;
constructing a feature vector corresponding to each internet protocol address in each preset time period according to feature information in the network behavior information, wherein each feature vector comprises a preset number of dimensions, and each dimension corresponds to a preset domain name;
clustering all the eigenvectors based on the plurality of eigenvectors obtained in each preset time period to obtain a plurality of candidate clustering results;
determining a target clustering result from the plurality of candidate clustering results based on a preset rule;
determining a user identifier corresponding to each cluster in the target clustering result;
acquiring target network behavior information of a user to be identified, and determining a target characteristic vector corresponding to the target network behavior information;
and determining a target cluster in the target clustering result corresponding to the target characteristic vector, and determining a user identifier corresponding to the target cluster as a user identifier of a user to be identified.
Optionally, the step of constructing a feature vector corresponding to each internet protocol address according to the feature information in the network behavior information includes:
determining access information of a preset domain name contained in the network behavior information as characteristic information;
and determining the access frequency or the access frequency of each preset domain name corresponding to each Internet protocol address based on the access information, and taking the access frequency or the access frequency as the dimension of the characteristic vector corresponding to the Internet protocol address to obtain the characteristic vector corresponding to each Internet protocol address.
Optionally, the step of performing clustering processing on all the feature vectors based on the plurality of feature vectors obtained in each preset time period to obtain a plurality of candidate clustering results includes:
taking a plurality of feature vectors corresponding to the first preset time period as a plurality of initial centroids, and calculating the distance from all the feature vectors to each initial centroid;
traversing all the feature vectors, dividing the feature vectors into clusters with the closest initial centroids according to the sequence of the distance between each feature vector and each initial centroid from small to large, when the number of the feature vectors in the cluster with the closest initial centroid reaches a maximum limit value, dividing subsequent feature vectors into the cluster with the next closest initial centroid until all the feature vectors are divided, and obtaining a plurality of clusters, wherein the maximum limit value is the number of the preset time period;
traversing each feature vector according to the sequence of the distance between each feature vector and the initial centroid of the cluster where the feature vector is located from large to small on the basis of the plurality of clusters, judging whether the cluster where the initial centroid of the minimum distance is located is the cluster where the feature vector is located currently, and judging whether the variance of two clusters which are exchanged is reduced after the feature vector is exchanged with the feature vector which is farthest from the initial centroid of the cluster in the minimum distance cluster if the feature vector is not located;
if so, exchanging the feature vector with the feature vector which is farthest from the initial centroid of the cluster in the minimum distance cluster until no exchangeable feature vector to be exchanged exists or the exchange frequency reaches a preset maximum exchange frequency to obtain a candidate clustering result;
determining a plurality of initial centroids for clustering processing based on the feature vectors of a plurality of internet protocol addresses of the next preset time period, wherein the number of the plurality of initial centroids is equal to the number of the plurality of initial centroids corresponding to the first preset time period;
and returning and executing the step of calculating the distance from all the characteristic vectors to each initial centroid, and returning and executing the step of calculating the distance from all the characteristic vectors to each initial centroid each time to obtain a candidate clustering result corresponding to a corresponding preset time period until the next preset time period does not exist to obtain a plurality of candidate clustering results.
Optionally, the step of determining a target clustering result from the plurality of candidate clustering results based on a preset rule includes:
respectively calculating the average variance of the plurality of candidate clustering results;
and comparing the average variances, and determining the candidate clustering result with the minimum average variance as a target clustering result.
Optionally, before the step of performing clustering processing on all the feature vectors based on the plurality of feature vectors obtained in each preset time period to obtain a plurality of candidate clustering results, the method further includes:
normalizing each feature vector to obtain a plurality of processed feature vectors;
the step of clustering all the eigenvectors based on the plurality of eigenvectors obtained in each preset time period to obtain a plurality of candidate clustering results comprises the following steps:
and clustering all the feature vectors based on the processed feature vectors corresponding to each preset time period to obtain a plurality of candidate clustering results.
In a second aspect, an embodiment of the present invention provides an apparatus for identifying a user based on domain name system traffic record data, where the apparatus includes:
the information acquisition module is used for acquiring network behavior information corresponding to a plurality of internet protocol addresses in a plurality of preset time periods, wherein the internet protocol address used by each user is unchanged in each preset time period;
the vector construction module is used for constructing a corresponding feature vector of each internet protocol address in each preset time period according to feature information in the network behavior information to obtain a plurality of feature vectors, wherein each feature vector comprises a preset number of dimensions, and each dimension corresponds to a preset domain name;
the vector processing module is used for clustering all the characteristic vectors based on the plurality of characteristic vectors obtained in each preset time period to obtain a plurality of candidate clustering results;
the result determining module is used for determining a target clustering result from the candidate clustering results based on a preset rule;
an identifier determining module, configured to determine a user identifier corresponding to each cluster in the target clustering result;
the vector determination module is used for acquiring target network behavior information of a user to be identified and determining a target characteristic vector corresponding to the target network behavior information;
and the user identification module is used for determining a target cluster in the target clustering result corresponding to the target characteristic vector and determining a user identifier corresponding to the target cluster as the user identifier of the user to be identified.
Optionally, the vector building module includes:
the information determining submodule is used for determining that the access information of the preset domain name contained in the network behavior information is characteristic information;
and the vector construction submodule is used for determining the access frequency or the access frequency of each preset domain name corresponding to each internet protocol address based on the access information, and obtaining the characteristic vector corresponding to each internet protocol address as the dimension of the characteristic vector corresponding to the internet protocol address.
Optionally, the vector processing module includes:
the calculation submodule is used for calculating the distance from all the characteristic vectors to each initial centroid by taking the plurality of characteristic vectors corresponding to the first preset time period as a plurality of initial centroids;
the vector dividing submodule is used for traversing all the feature vectors, dividing the feature vectors into clusters with the closest initial centroids according to the sequence that the distance between each feature vector and each initial centroid is from small to large, dividing subsequent feature vectors into clusters with the next closest initial centroid when the number of the feature vectors in the cluster with the closest initial centroid reaches a maximum limit value until all the feature vectors are divided, and obtaining a plurality of clusters, wherein the maximum limit value is the number of the preset time period;
the judging submodule is used for traversing each feature vector according to the sequence of the distance between each feature vector and the initial centroid of the cluster where the feature vector is located from large to small on the basis of the plurality of clusters, judging whether the cluster where the initial centroid with the minimum distance is located is the cluster where the feature vector is located currently or not for each feature vector, and judging whether the variance of two clusters which are exchanged after the feature vector with the maximum distance from the initial centroid in the cluster with the minimum distance is exchanged is reduced or not if the cluster where the initial centroid is located is not the cluster where the feature vector with the maximum distance from the initial centroid;
the vector processing submodule is used for exchanging the feature vector with the feature vector which is farthest from the initial centroid of the cluster in the minimum distance cluster if the feature vector is the minimum distance cluster, until no exchangeable feature vector to be exchanged exists or the exchange frequency reaches the preset maximum exchange frequency, and obtaining a candidate clustering result;
an initial centroid determining submodule, configured to determine a plurality of initial centroids for clustering processing based on feature vectors of a plurality of internet protocol addresses of a next preset time period, where a number of the plurality of initial centroids is equal to a number of the plurality of initial centroids in a first preset time period;
and the circulation submodule is used for triggering the calculation submodule to calculate the distance from all the characteristic vectors to each initial centroid, and the calculation submodule is used for calculating the distance from all the characteristic vectors to each initial centroid each time to obtain a candidate clustering result corresponding to a corresponding preset time period until the next preset time period does not exist to obtain a plurality of candidate clustering results.
Optionally, the result determining module includes:
the variance calculation submodule is used for calculating the average variance of the candidate clustering results respectively;
and the result determining submodule is used for comparing the average variances and determining the candidate clustering result with the minimum average variance as the target clustering result.
Optionally, the apparatus further comprises:
the normalization processing module is used for performing normalization processing on each feature vector to obtain a plurality of processed feature vectors before the vector processing module performs clustering processing on all feature vectors based on a plurality of feature vectors obtained in each preset time period to obtain a plurality of candidate clustering results;
the vector processing module comprises:
and the clustering processing submodule is used for clustering all the eigenvectors based on the plurality of processed eigenvectors corresponding to each preset time period to obtain a plurality of candidate clustering results.
The embodiment of the invention provides a method for identifying a user based on domain name system flow record data, which comprises the following steps: acquiring network behavior information corresponding to a plurality of internet protocol addresses in a plurality of preset time periods, constructing a characteristic vector corresponding to each internet protocol address in each preset time period according to the characteristic information in the network behavior information, acquiring a plurality of characteristic vectors, clustering all the characteristic vectors based on the plurality of characteristic vectors acquired in each preset time period to acquire a plurality of candidate clustering results, determining a target clustering result from the plurality of candidate clustering results based on a preset rule, determining a user identifier corresponding to each cluster in the target clustering result, acquiring target network behavior information of a user to be identified, determining a target characteristic vector corresponding to the target network behavior information, determining a target cluster in the target clustering result corresponding to the target characteristic vector, and determining the user identifier corresponding to the target cluster as the user identifier of the user to be identified, in each preset time period, the internet protocol address used by each user is unchanged, each feature vector comprises a preset number of dimensions, and each dimension corresponds to a preset domain name. The scheme provided by the embodiment of the invention can construct the characteristic vector based on the acquired network behavior information, and the characteristic vector can identify the characteristics of the user network behavior information, so that the characteristic vectors with similar rules can be classified into one class by clustering the characteristic vectors, namely, the periodic rules of the user network behavior information are found out, and then the user identification corresponding to each cluster can be determined through the target clustering result, and the user can be identified to the greatest extent under the condition that the IP address of the user is continuously changed.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.
Fig. 1 is a flowchart of a method for identifying a user based on domain name system traffic record data according to an embodiment of the present invention.
Fig. 2 is a specific flowchart of step S102 in the embodiment shown in fig. 1.
Fig. 3 is a specific flowchart of step S103 in the embodiment shown in fig. 1.
Fig. 4 is a schematic structural diagram of an apparatus for identifying a user based on domain name system traffic record data according to an embodiment of the present invention.
Fig. 5 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
In the related data mining technology, a large number of DNS log records of network communication of an identity determination user need to be obtained in advance, and in the network communication, in order to ensure the privacy of the user, a large number of measures capable of changing the IP address of the user are adopted, so it is very difficult to determine a certain user from the DNS log records of a large number of network communications.
In order to solve the problem, embodiments of the present invention provide a method and an apparatus for identifying a user based on domain name system traffic record data, an electronic device, and a computer-readable storage medium. First, a method for identifying a user based on domain name system traffic record data according to an embodiment of the present invention is described below.
The method for identifying the user based on the traffic record data of the domain name system provided by the embodiment of the invention can be applied to any electronic equipment needing user identification, for example, a server and the like, and is not particularly limited herein. For clarity of description, hereinafter referred to as electronic device.
As shown in fig. 1, a method for identifying a user based on domain name system traffic record data, the method comprising:
s101, network behavior information corresponding to a plurality of Internet protocol addresses in a plurality of preset time periods is obtained.
And in each preset time period, the internet protocol address used by each user is unchanged.
S102, according to the characteristic information in the network behavior information, constructing a corresponding characteristic vector of each Internet protocol address in each preset time period, and obtaining a plurality of characteristic vectors.
Each feature vector comprises a preset number of dimensions, and each dimension corresponds to a preset domain name.
And S103, clustering all the eigenvectors based on the plurality of eigenvectors obtained in each preset time period to obtain a plurality of candidate clustering results.
And S104, determining a target clustering result from the candidate clustering results based on a preset rule.
S105, determining a user identifier corresponding to each cluster in the target clustering result.
S106, acquiring target network behavior information of the user to be identified, and determining a target characteristic vector corresponding to the target network behavior information.
S107, determining a target cluster in the target clustering result corresponding to the target feature vector, and determining a user identifier corresponding to the target cluster as a user identifier of a user to be identified.
It can be seen that in the solution provided in the embodiment of the present invention, the electronic device may obtain network behavior information corresponding to a plurality of internet protocol addresses in a plurality of preset time periods, where the internet protocol address used by each user is not changed in each preset time period, construct a feature vector corresponding to each internet protocol address in each preset time period according to the feature information in the network behavior information, obtain a plurality of feature vectors, perform clustering processing on all feature vectors based on the plurality of feature vectors obtained in each preset time period, obtain a plurality of candidate clustering results, determine a target clustering result from the plurality of candidate clustering results based on a preset rule, determine a user identifier corresponding to each cluster in the target clustering result, obtain target network behavior information of the user to be identified, and determine a target feature vector corresponding to the target network behavior information, and determining a target cluster in the target clustering result corresponding to the target characteristic vector, and determining a user identifier corresponding to the target cluster as the user identifier of the user to be identified. The method provided by the embodiment of the invention can construct the characteristic vector based on the acquired network behavior information, and the characteristic vector can identify the characteristics of the user network behavior information, so that the characteristic vectors with similar rules can be classified into one class by clustering the characteristic vectors, namely, the periodic rules of the user network behavior information are found out, and then the user identification corresponding to each cluster can be determined through the target clustering result, and the user can be identified to the greatest extent under the condition that the IP address of the user is continuously changed.
In step S101, the electronic device may acquire network behavior information corresponding to a plurality of internet protocol addresses in a plurality of preset time periods. For convenience of subsequent processing, the internet protocol address used by each user is not changed in each preset time period, that is, in each preset time period, it can be determined that the network behavior information corresponding to each internet protocol address is generated by the network communication behavior of the same user.
The preset time period may be determined according to actual conditions, and may be, for example, one day, 8 hours, three days, and the like, which is not specifically limited herein.
The network behavior information can be information generated by network communication behaviors of users, such as DNS network logs. In one embodiment, the network behavior information may be a traffic log obtained from a DNS server through an ELK (elastic search, Logstash, Kibana, open source distributed search engine and tool) port. Since the DNS server has a function of domain name resolution, the traffic log may include the plurality of internet protocol addresses, domain names, network communication data related to the domain names, and the like, and is not limited in this respect.
In step S102, the electronic device may construct a feature vector corresponding to each internet protocol address in each preset time period according to the feature information in the network behavior information. Since each ip address may have one or more accesses to one or more domain names within a predetermined time period, in order to facilitate processing of the network behavior information, feature information may be determined from the network behavior information, and then a feature vector corresponding to each ip address may be constructed from the feature information.
The feature vectors may correspond to each ip address one-to-one to characterize the network behavior information corresponding to each ip address. The electronic device may predetermine a plurality of preset domain names, which may be common and/or representative domain names. Each dimension in the feature vector may correspond to a preset domain name to identify feature information associated with the preset domain name.
Thus, when the feature information is determined, the feature vector may be constructed from the network communication data associated with the domain name. In one embodiment, if a certain ip address does not access a certain preset domain name within a preset time period, the dimension corresponding to the preset domain name may be 0 or null; if a certain internet protocol address accesses a certain preset domain name within a preset time period, the dimension corresponding to the preset domain name may be information capable of identifying an access behavior.
After obtaining the plurality of feature vectors, the electronic device may execute step S103, that is, perform clustering processing on all feature vectors based on the plurality of feature vectors obtained in each preset time period to obtain a plurality of candidate clustering results.
In the process of clustering data, the initial centroid can greatly affect the final clustering result, so if only a plurality of characteristic vectors corresponding to one preset time period are used as the initial centroids for clustering, the error of the obtained clustering result is likely to be larger, in order to ensure the accuracy of the clustering result, the characteristic vectors obtained in each preset time period can be respectively used as the initial centroids for clustering, the clustering objects are all characteristic vectors, thus, different candidate clustering results can be obtained by clustering based on the characteristic vectors obtained in different preset time periods, and each candidate clustering result can contain clusters with the same number as the initial centroids.
In the clustering result obtained by the clustering algorithm, each cluster corresponds to one category, and the information included in each cluster is the information of the same category, so the clustering algorithm is called. In the embodiment of the present invention, the feature vector included in each cluster of the candidate clustering result is the feature vector of the same category having similar features.
In one embodiment, each cluster of each candidate clustering result includes a plurality of feature vectors, the plurality of feature vectors may correspond to different internet protocol addresses and may also correspond to the same internet protocol address, and even if the plurality of feature vectors correspond to different internet protocol addresses, since the feature vectors included in each cluster are feature vectors of the same category having similar features, although the internet protocol addresses are different, it is likely that the corresponding users are still highly likely to be the same user due to the regular change of the internet protocol addresses.
After obtaining the candidate clustering results, the step S104 may be executed, that is, one of the candidate clustering results is determined as a final result, i.e., the target clustering result, based on a preset rule.
Since each candidate clustering result is obtained by clustering all the feature vectors under the condition that the initial centroids are different, and the accuracy of a plurality of candidate clustering results is different, in order to more accurately identify the user, the electronic device can determine a target clustering result with the highest accuracy from the plurality of candidate clustering results.
In an embodiment, the preset rule may be a variance minimization rule of the clustering result, and specifically, the electronic device may calculate variances of the multiple candidate clustering results, and further use a candidate clustering result with the smallest variance as the target clustering result, so that a difference between feature vectors in each cluster of the obtained target clustering result is small, and the user can be identified more accurately.
In step S105, in order to facilitate subsequent user identification, the electronic device may determine a user identifier corresponding to each cluster in the target clustering result. In this step, the electronic device may set a user identifier for the user corresponding to each cluster, so that when the electronic device subsequently obtains the network behavior information corresponding to a certain internet protocol address, the user generating the network behavior information may be identified by the user identifier.
The user identifier may be any identifier capable of uniquely identifying a user, and may be characters such as letters and numbers, or a combination of at least two of the characters, and the like, which is not specifically limited herein. For example, the target clustering result includes 5 clusters, which are respectively a cluster 1 to a cluster 5, and then the electronic device may determine that the clusters 1 to 5 respectively correspond to the user identifier a, the user identifier b, the user identifier c, the user identifier d, and the user identifier e. For another example, the target clustering result includes 10 clusters, and the electronic device determines that the user identifiers corresponding to the 10 clusters are 01 to 10, or a to J, and the like.
After the user identifier is determined, when the network behavior information of the user to be identified is acquired, the electronic device may construct a target feature vector according to the feature information therein, that is, execute step S106. The specific manner of constructing the target feature vector may be the same as that of constructing the feature vector described above.
Furthermore, in step S107, the electronic device may determine a target cluster corresponding to the target feature vector, and determine a user identifier corresponding to the target cluster as the user identifier of the user to be identified, so as to also implement user identification.
In an implementation manner, the electronic device may calculate a similarity between the target feature vector and each cluster included in the target clustering result, and determine a cluster with the highest similarity as a target cluster, where a user identifier corresponding to the target cluster is a user identifier of a user to be identified.
For example, the target clustering result includes 5 clusters, which are respectively a cluster 1 to a cluster 5, and respectively correspond to the user identifier a, the user identifier b, the user identifier c, the user identifier d, and the user identifier e. Then if the electronic device determines that the cluster 4 has the highest similarity to the target feature vector, it may determine that the user identifier of the user to be identified is the user identifier d.
As an implementation manner of the embodiment of the present invention, as shown in fig. 2, the step S102 may include:
s201, determining access information of a preset domain name included in the network behavior information as feature information.
Because each user has different network communication habits, websites visited by each user every day are different, and the access information of the preset domain names can represent the network communication habits of the users, so that the access information aiming at the preset domain names can be used as characteristic information.
The access information for the domain name may include a source internet protocol address, a domain name, an access time, and the like, which is not specifically limited herein.
In order to facilitate the use of the feature information in the subsequent processing, the feature information may be stored after being acquired. For example, each kind of information included in the feature information a may be indexed and stored in the database in a manner shown in the following table:
as can be seen from the above table, the feature information a includes a source IP address, an access domain name, and an access timestamp, and the electronic device may define its field name, storage mode, and index mode in a database language programming mode, so that when the feature information is subsequently used, the related feature information may be searched from the database. Of course, the feature information may also be stored in other manners, and is not limited in particular.
S202, determining the access frequency or the access frequency of each preset domain name corresponding to each Internet protocol address based on the access information, and taking the access frequency or the access frequency as the dimension of the feature vector corresponding to the Internet protocol address to obtain the feature vector corresponding to each Internet protocol address.
Since the access frequency or the number of times of each preset domain name corresponding to each internet protocol address can represent the access behavior habit of the user to each preset domain name, after the characteristic information is determined, the electronic device can determine the access time of each preset domain name corresponding to each internet protocol address, can also determine the access frequency or the number of times of each domain name corresponding to each internet protocol address in a preset time period, and further can use the frequency or the number of times as the dimension corresponding to the preset domain name in the characteristic vector.
As an embodiment, the plurality of feature vectors obtained within the preset time period may be represented by a matrix, for example, the following matrix:
Figure BDA0002255969730000122
each row in the matrix represents a feature vector of an internet protocol address, each element in each row corresponds to a preset domain name, and the elements in each column represent the access frequency or the access frequency of the internet protocol address to the preset domain name.
As can be seen, in the scheme provided in the embodiment of the present invention, the electronic device may determine that access information of a preset domain name included in the network behavior information is feature information, determine, based on the access information, an access frequency or a number of times of each preset domain name corresponding to each internet protocol address, and obtain, as a dimension of a feature vector corresponding to the internet protocol address, the feature vector corresponding to each internet protocol address. In this way, the electronic device can use the access frequency or the number of times of each preset domain name corresponding to each internet protocol address as the dimension of the feature vector corresponding to the internet protocol address, and can accurately represent the periodic features of the network behavior performed by the user, which are embodied by the network behavior information.
As an implementation manner of the embodiment of the present invention, as shown in fig. 3, the step S103 may include:
s301, taking a plurality of feature vectors corresponding to the first preset time period as a plurality of initial centroids, and calculating the distance from all the feature vectors to each initial centroid.
For each preset time period, the electronic device may determine one feature vector corresponding to each internet address, that is, after returning to perform step S101 for multiple times, the electronic device may obtain multiple feature vectors corresponding to each internet address. In order to distinguish the feature vectors corresponding to the network behavior information of the same user in all the feature vectors, the electronic device may process the feature vectors in a clustering manner.
In this embodiment, the electronic device may use a plurality of feature vectors corresponding to a first preset time period as initial centroids of the clusters, and calculate distances from all the feature vectors to each of the initial centroids. For an initial centroid, all feature vectors are all other feature vectors except for the initial centroid.
The algorithm used in the clustering process may be a K-means clustering algorithm, a mean shift clustering algorithm, or a gaussian algorithm, and is not particularly limited herein.
S302, traversing all the feature vectors, dividing the feature vectors into clusters with the closest initial centroids according to the sequence that the distance between each feature vector and each initial centroid is from small to large, and when the number of the feature vectors in the cluster with the closest initial centroid reaches the maximum limit value, dividing the subsequent feature vectors into the next cluster with the closest initial centroid until all the feature vectors are divided, so as to obtain a plurality of clusters.
Wherein the maximum limit value is the number of the preset time periods.
Because only one feature vector can be constructed for the network behavior information corresponding to each internet protocol address in each preset time period, if an internet protocol address without any network access exists in one preset time period, the network behavior information corresponding to the internet protocol address acquired by the electronic equipment is empty in the preset time period, and therefore the feature vector cannot be constructed. That is, the number of all feature vectors corresponding to each ip address may not exceed the number of the preset time periods, and thus the number of the preset time periods may be determined as the maximum limit value.
After calculating the distance from each feature vector to the initial centroid, the electronic device may sequentially divide each feature vector into clusters where the initial centroid closest to the feature vector is located according to the distance between each feature vector and the initial centroid.
For example, the initial centroids include initial centroid j1, initial centroid j2 and initial centroid j3, and then for the feature vector t, the distance from the initial centroid j2 is the smallest, the distance from the initial centroid j3 is the second smallest, and the distance from the initial centroid j1 is the largest, when dividing the feature vector t, if the number of the feature vectors included in the cluster where the initial centroid j2 is located at this time does not reach the maximum limit value, the feature vector t is divided into the clusters where the initial centroid j2 is located; if the number of the feature vectors included in the cluster of the initial centroid j2 reaches the maximum limit value, the cluster of the initial centroid j3 is divided into the clusters of the initial centroid j3, and if the number of the feature vectors included in the cluster of the initial centroid j3 reaches the maximum limit value, the cluster of the initial centroid j1 is divided into the clusters of the initial centroid j 1.
Since the number of feature vectors is fixed, if there are more feature vectors in a cluster, clusters including a small number of feature vectors will inevitably occur, which is not favorable for the accuracy of the clustering result, so that the number of feature vectors included in each cluster can be made equal to the greatest extent, that is, the number of feature vectors in each cluster must not exceed the maximum limit value.
Therefore, when the feature vectors are divided, when the number of the feature vectors in one cluster reaches the maximum limit value, the electronic device can not divide the feature vectors into the cluster any more, and continue to divide the remaining non-divided feature vectors into other clusters with the initial centroids closest to the cluster until all the feature vectors are divided, so that the clusters with the same number of the centroids can be obtained.
And S303, traversing each feature vector according to the sequence of the distance between each feature vector and the initial centroid of the cluster where the feature vector is located from large to small based on the plurality of clusters, judging whether the cluster where the initial centroid with the minimum distance is located is the cluster where the feature vector is located currently, and judging whether the variance of two clusters which are exchanged is reduced after the feature vector is exchanged with the feature vector which is farthest from the initial centroid of the cluster in the minimum distance cluster if the cluster is not located.
After all the feature vectors are divided, a plurality of clusters can be obtained. And since the number of feature vectors in each cluster does not exceed the maximum limit, there may be feature vectors in each cluster that are not the smallest distance from the initial centroid of the cluster compared to the initial centroids of all clusters, and in order for the feature vectors included in each cluster to be as smallest distance from the initial centroid of the cluster as possible, the electronic device may determine feature vectors in each cluster that are not the smallest distance from the initial centroid of the cluster compared to the initial centroids of all clusters.
Furthermore, after the above feature vector not having the minimum distance is exchanged with the feature vector farthest from the initial centroid of the cluster among the clusters where the initial centroid closest to the feature vector is located, the variance of the two clusters may be calculated, and if the variance is not small, it indicates that the similarity between the feature vectors in the two clusters after the exchange is not high, the exchange is not performed, and if the variance is small, step S304 may be performed.
S304, the feature vector is exchanged with the feature vector which is farthest from the initial centroid of the cluster in the minimum distance cluster until no feature vector to be exchanged can be exchanged or the exchange frequency reaches the preset maximum exchange frequency, and a candidate clustering result is obtained.
If the variance between the feature vector which is not the minimum distance and the feature vector which is the closest to the initial centroid in the cluster and is the farthest from the initial centroid of the cluster becomes smaller, the two feature vectors can be exchanged if the similarity between the feature vectors in the two exchanged clusters is higher, so that the distribution of each feature vector in the exchanged clusters is denser, that is, the similarity is higher.
After the exchange is carried out for a plurality of times, the exchange can be stopped until no exchangeable characteristic vector exists or the exchange times reach the preset maximum exchange times, and the clustering result obtained after the treatment is a candidate clustering result.
S305, determining a plurality of initial centroids for clustering processing based on the feature vectors of a plurality of Internet protocol addresses of the next preset time period, and returning to execute the step of calculating the distance from all the feature vectors to each initial centroid, wherein the step of returning to execute the distance from all the feature vectors to each initial centroid each time obtains a candidate clustering result corresponding to the corresponding preset time period until the next preset time period does not exist, and obtaining a plurality of candidate clustering results.
After the candidate clustering result is obtained, the feature vectors corresponding to the rest of the preset time periods can be processed in the same clustering processing mode to obtain a plurality of candidate clustering results. The electronic device may determine a plurality of initial centroids of the clustering process with a feature vector based on a plurality of internet protocol addresses of a next preset time period, wherein the number of the plurality of initial centroids is equal to the number of the plurality of initial centroids corresponding to the first preset time period.
In some cases, since a certain ip address may not be connected to the network in each preset time period, that is, in a certain preset time period, there may be no feature vector corresponding to a certain ip address, and therefore, the number of feature vectors in each preset time period may be different.
In order to solve this problem, the electronic device may preset a target initial centroid number that is not less than the number of feature vectors corresponding to the largest feature vectors for all preset time periods. Thus, when clustering is performed by taking the feature vector corresponding to a certain preset time period as the initial centroid, if the number of the initial centroids is smaller than the number of the target initial centroids, the missing feature vector is randomly selected from the feature vectors corresponding to other preset time periods as the initial centroid of the secondary clustering, so as to ensure that the number of clusters in each clustering result is equal. Of course, it is reasonable to select the target according to a certain rule.
For example, the number of target initial centroids preset by the electronic device is 12, and for a preset time period S, the number of obtained feature vectors is only 10, so that when the feature vectors corresponding to the preset time period S are used as the initial centroids to be processed, 2 feature vectors can be randomly selected from the feature vectors corresponding to other preset time periods, and the 12 feature vectors are used as the initial centroids of the current clustering process.
After determining a plurality of initial centroids for clustering, the step of calculating the distance from all feature vectors to each initial centroid may be returned, that is, the steps S302 to S305 may be repeatedly performed until there is no next preset time period, so as to obtain a candidate clustering result corresponding to each preset time period.
Therefore, in the scheme provided by the embodiment of the invention, the electronic equipment can perform clustering processing in the clustering mode to obtain an accurate candidate clustering result corresponding to each preset time period, so that the accuracy of subsequent user identification is improved.
As an implementation manner of the embodiment of the present invention, the step of determining the target clustering result from the plurality of candidate clustering results based on the preset rule may include:
respectively calculating the average variance of the plurality of candidate clustering results; and comparing the average variances, and determining a candidate clustering result with the minimum average variance as a target clustering result.
Since the initial centroids selected during the clustering process are different, the obtained multiple candidate clustering results are likely to be different, and in order to more accurately identify the user, the electronic device may determine an optimal clustering result from the multiple candidate clustering results as a final result, i.e., a target clustering result.
Specifically, the variance may represent the fluctuation degree of a plurality of pieces of information, so for the feature vector, the variance may represent the degree of acquaintance between a plurality of feature vectors, and therefore, the electronic device may calculate the average variances of a plurality of candidate clustering results, compare the average variances, and select a clustering result with the smallest average variance among the plurality of candidate clustering results as the target clustering result. In this way, the similarity between feature vectors in the clusters included in the obtained target clustering result is the highest, that is, the clustering accuracy is the highest.
Therefore, in the scheme provided by the embodiment of the invention, the electronic device can respectively calculate the average variances of a plurality of candidate clustering results, compare the average variances, and further determine the candidate clustering result with the minimum average variance as the target clustering result. Therefore, the electronic equipment can determine the optimal clustering result from the plurality of candidate clustering results as the target clustering result, and the accuracy of user identification is further improved.
As an implementation manner of the embodiment of the present invention, before the step S103, the method may further include:
and carrying out normalization processing on each feature vector to obtain a plurality of processed feature vectors.
Before clustering all the feature vectors, the electronic equipment can normalize each feature vector, so that the calculation difficulty can be reduced, the time required by subsequent processing can be reduced, and the feature vectors can be conveniently clustered.
Correspondingly, the step of performing clustering processing on all the feature vectors based on the plurality of feature vectors obtained in each preset time period to obtain a plurality of candidate clustering results may include:
and clustering all the feature vectors based on the processed feature vectors corresponding to each preset time period to obtain a plurality of candidate clustering results.
In this case, the electronic device may perform clustering processing on all the feature vectors based on the processed feature vectors corresponding to each preset time period to obtain a plurality of candidate clustering results.
Therefore, in the scheme provided by the embodiment of the invention, the normalization processing is carried out on the characteristic vectors, so that the difficulty of obtaining the candidate clustering result through calculation is reduced, and the clustering processing efficiency is improved.
Corresponding to the method for identifying a user based on the domain name system traffic record data provided by the above embodiment of the present invention, an embodiment of the present invention further provides a device for identifying a user based on the domain name system traffic record data, a schematic structural diagram of which is shown in fig. 4, and the method may include:
the information obtaining module 410 is configured to obtain network behavior information corresponding to a plurality of internet protocol addresses in a plurality of preset time periods.
And in each preset time period, the internet protocol address used by each user is unchanged.
The vector construction module 420 is configured to construct a feature vector corresponding to each internet protocol address in each preset time period according to the feature information in the network behavior information, so as to obtain a plurality of feature vectors.
Each feature vector comprises a preset number of dimensions, and each dimension corresponds to a preset domain name.
The vector processing module 430 is configured to perform clustering processing on all the feature vectors based on the plurality of feature vectors obtained in each preset time period to obtain a plurality of candidate clustering results.
A result determining module 440, configured to determine a target clustering result from the plurality of candidate clustering results based on a preset rule.
An identifier determining module 450, configured to determine a user identifier corresponding to each cluster in the target clustering result.
The vector determining module 460 is configured to obtain target network behavior information of the user to be identified, and determine a target feature vector corresponding to the target network behavior information.
And the user identification module 470 is configured to determine a target cluster in the target clustering result corresponding to the target feature vector, and determine a user identifier corresponding to the target cluster as a user identifier of a user to be identified.
It can be seen that in the solution provided in the embodiment of the present invention, the electronic device may obtain network behavior information corresponding to a plurality of internet protocol addresses in a plurality of preset time periods, construct a feature vector corresponding to each internet protocol address in each preset time period according to the feature information in the network behavior information, obtain a plurality of feature vectors, perform clustering processing on all feature vectors based on the plurality of feature vectors obtained in each preset time period, obtain a plurality of candidate clustering results, determine a target clustering result from the plurality of candidate clustering results based on a preset rule, determine a user identifier corresponding to each cluster in the target clustering result, obtain target network behavior information of a user to be identified, determine a target feature vector corresponding to the target network behavior information, determine a target cluster in the target clustering result corresponding to the target feature vector, and determine a user identifier corresponding to the target cluster as a user identifier of the user to be identified, in a preset time period, the internet protocol address used by each user is unchanged, each feature vector comprises a preset number of dimensions, and each dimension corresponds to a preset domain name. The feature vectors can be constructed based on the acquired network behavior information, and can identify the features of the user network behavior information, so that the feature vectors are clustered, the feature vectors with similar rules can be classified into one class, namely, the periodic rules of the user network behavior information are found, and then the user identification corresponding to each cluster can be determined through a target clustering result, and the user can be identified to the greatest extent under the condition that the IP address of the user is continuously changed.
As an implementation manner of the embodiment of the present invention, the vector construction module 420 may include:
and an information determining sub-module (not shown in fig. 4) configured to determine, as the feature information, the access information of the preset domain name included in the network behavior information.
A vector construction sub-module (not shown in fig. 4) configured to determine, based on the access information, an access frequency or a number of times of each preset domain name corresponding to each internet protocol address, where the access frequency or the number of times is used as a dimension of a feature vector corresponding to the internet protocol address, and obtain the feature vector corresponding to each internet protocol address.
As an implementation manner of the embodiment of the present invention, the vector processing module 430 may include:
and a calculation submodule (not shown in fig. 4) configured to calculate distances from all the feature vectors to each of the initial centroids by using the plurality of feature vectors corresponding to the first preset time period as the plurality of initial centroids.
And a vector dividing submodule (not shown in fig. 4) configured to traverse all the feature vectors, divide the feature vectors into clusters where initial centroids closest to each other in a descending order of distance between each feature vector and each initial centroid, and divide subsequent feature vectors into clusters where initial centroids closest to each other next to each other when the number of feature vectors in the clusters where the initial centroids closest to each other reaches a maximum limit value until all the feature vectors are divided, so as to obtain multiple clusters.
Wherein the maximum limit value is the number of the preset time periods.
And a determining submodule (not shown in fig. 4) configured to traverse each feature vector according to a sequence that a distance between each feature vector and an initial centroid of a cluster where the feature vector is currently located decreases from large to small, determine, for each feature vector, whether a cluster where an initial centroid of a minimum distance is located is a cluster where the feature vector is currently located, and determine, if not, whether a variance between the feature vector and a feature vector, which is farthest from the initial centroid of the cluster, in the cluster where the minimum distance is located is decreased.
And the vector processing sub-module (not shown in fig. 4) is configured to swap the feature vector with the feature vector in the minimum distance cluster that is farthest from the initial centroid of the cluster, if so, until there is no exchangeable feature vector to be swapped or the swapping frequency reaches a preset maximum swapping frequency, so as to obtain a candidate clustering result.
An initial centroid determining submodule (not shown in fig. 4) for determining a plurality of initial centroids of the clustering process based on feature vectors of a plurality of internet protocol addresses of a next one of the preset time periods.
Wherein the number of the plurality of initial centroids is equal to the number of the plurality of initial centroids in the first preset time period.
And a circulation sub-module (not shown in fig. 4) configured to trigger the calculation sub-module to perform calculation on distances from all the feature vectors to each of the initial centroids, where the calculation is performed each time to obtain a candidate clustering result corresponding to a corresponding preset time period, and a plurality of candidate clustering results are obtained until there is no next preset time period.
As an implementation manner of the embodiment of the present invention, the result determining module 440 may include:
and a variance calculation sub-module (not shown in fig. 4) for calculating average variances of the candidate clustering results, respectively.
And the result determining submodule (not shown in fig. 4) is used for comparing the average variances and determining the candidate clustering result with the minimum average variance as the target clustering result.
As an implementation manner of the embodiment of the present invention, the apparatus may further include:
a normalization processing module (not shown in fig. 4) configured to, before the vector processing module performs clustering on all feature vectors based on a plurality of feature vectors obtained in each preset time period to obtain a plurality of candidate clustering results, perform normalization processing on each feature vector to obtain a plurality of processed feature vectors;
the vector processing module 430 may include:
and a clustering processing sub-module (not shown in fig. 4) configured to perform clustering processing on all the feature vectors based on the processed feature vectors corresponding to each preset time period to obtain multiple candidate clustering results.
The embodiment of the present invention further provides an electronic device, as shown in fig. 5, including a processor 501, a communication interface 502, a memory 503 and a communication bus 504, where the processor 501, the communication interface 502 and the memory 503 complete mutual communication through the communication bus 504;
a memory 503 for storing a computer program;
the processor 501, when executing the program stored in the memory 503, implements the following steps:
and acquiring network behavior information corresponding to a plurality of internet protocol addresses in a plurality of preset time periods.
And in each preset time period, the internet protocol address used by each user is unchanged.
And according to the characteristic information in the network behavior information, constructing a corresponding characteristic vector of each Internet protocol address in each preset time period to obtain a plurality of characteristic vectors.
Each feature vector comprises a preset number of dimensions, and each dimension corresponds to a preset domain name.
And clustering all the characteristic vectors based on the plurality of characteristic vectors obtained in each preset time period to obtain a plurality of candidate clustering results.
And determining a target clustering result from the plurality of candidate clustering results based on a preset rule.
And determining the user identification corresponding to each cluster in the target clustering result.
The method comprises the steps of obtaining target network behavior information of a user to be identified, and determining a target characteristic vector corresponding to the target network behavior information.
And determining a target cluster in the target clustering result corresponding to the target characteristic vector, and determining a user identifier corresponding to the target cluster as a user identifier of a user to be identified.
It can be seen that in the solution provided in the embodiment of the present invention, the electronic device may obtain network behavior information corresponding to a plurality of internet protocol addresses in a plurality of preset time periods, construct a feature vector corresponding to each internet protocol address in each preset time period according to the feature information in the network behavior information, obtain a plurality of feature vectors, perform clustering processing on all feature vectors based on the plurality of feature vectors obtained in each preset time period, obtain a plurality of candidate clustering results, determine a target clustering result from the plurality of candidate clustering results based on a preset rule, determine a user identifier corresponding to each cluster in the target clustering result, obtain target network behavior information of a user to be identified, determine a target feature vector corresponding to the target network behavior information, determine a target cluster in the target clustering result corresponding to the target feature vector, and determine a user identifier corresponding to the target cluster as a user identifier of the user to be identified, in a preset time period, the internet protocol address used by each user is unchanged, each feature vector comprises a preset number of dimensions, and each dimension corresponds to a preset domain name. The feature vectors can be constructed based on the acquired network behavior information, and can identify the features of the user network behavior information, so that the feature vectors are clustered, the feature vectors with similar rules can be classified into one class, namely, the periodic rules of the user network behavior information are found, and then the user identification corresponding to each cluster can be determined through a target clustering result, and the user can be identified to the greatest extent under the condition that the IP address of the user is continuously changed.
The communication bus mentioned in the electronic device may be a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The communication bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one thick line is shown, but this does not mean that there is only one bus or one type of bus.
The communication interface is used for communication between the electronic equipment and other equipment.
The Memory may include a Random Access Memory (RAM) or a Non-Volatile Memory (NVM), such as at least one disk Memory. Optionally, the memory may also be at least one memory device located remotely from the processor.
The Processor may be a general-purpose Processor, including a Central Processing Unit (CPU), a Network Processor (NP), and the like; but may also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic device, discrete hardware component.
As an implementation manner of the embodiment of the present invention, the step of constructing a feature vector corresponding to each internet protocol address according to feature information in the network behavior information may include:
and determining that the access information of the preset domain name contained in the network behavior information is characteristic information.
And determining the access frequency or the access frequency of each preset domain name corresponding to each Internet protocol address based on the access information, and taking the access frequency or the access frequency as the dimension of the characteristic vector corresponding to the Internet protocol address to obtain the characteristic vector corresponding to each Internet protocol address.
As an implementation manner of the embodiment of the present invention, the step of performing clustering processing on all feature vectors based on a plurality of feature vectors obtained in each preset time period to obtain a plurality of candidate clustering results may include:
and taking a plurality of feature vectors corresponding to the first preset time period as a plurality of initial centroids, and calculating the distance from all the feature vectors to each initial centroid.
Traversing all the feature vectors, dividing the feature vectors into clusters with the closest initial centroids according to the sequence of the distance between each feature vector and each initial centroid from small to large, and when the number of the feature vectors in the cluster with the closest initial centroids reaches the maximum limit value, dividing the subsequent feature vectors into the cluster with the next closest initial centroids until all the feature vectors are divided, thereby obtaining a plurality of clusters.
Wherein the maximum limit value is the number of the preset time periods.
Based on the plurality of clusters, traversing each feature vector according to the sequence of the distance between each feature vector and the initial centroid of the cluster where the feature vector is located from large to small, judging whether the cluster where the initial centroid with the minimum distance is located is the cluster where the feature vector is located currently, and judging whether the variance of two clusters which are exchanged is reduced after the feature vector is exchanged with the feature vector which is farthest from the initial centroid of the cluster in the minimum distance cluster if the cluster is not located.
If so, exchanging the feature vector with the feature vector which is farthest from the initial centroid of the cluster in the minimum distance cluster until no feature vector to be exchanged can be exchanged or the exchange frequency reaches a preset maximum exchange frequency, and obtaining a candidate clustering result.
And taking the feature vectors of a plurality of internet protocol addresses based on the next preset time period as a plurality of initial centroids of the clustering process.
Wherein the number of the plurality of initial centroids is equal to the number of the plurality of initial centroids in the first preset time period.
And returning and executing the step of calculating the distance from each feature vector to each initial centroid, and returning and executing the step of calculating the distances from all the feature vectors to each initial centroid each time to obtain a candidate clustering result corresponding to a corresponding preset time period until the next preset time period does not exist to obtain a plurality of candidate clustering results.
As an implementation manner of the embodiment of the present invention, the step of determining the target clustering result from the plurality of candidate clustering results based on the preset rule may include:
respectively calculating the average variance of the plurality of candidate clustering results;
and comparing the average variances, and determining the candidate clustering result with the minimum average variance as a target clustering result.
As an implementation manner of the embodiment of the present invention, before the step of performing clustering processing on all feature vectors based on a plurality of feature vectors obtained in each preset time period to obtain a plurality of candidate clustering results, the method may further include:
normalizing each feature vector to obtain a plurality of processed feature vectors;
the step of performing clustering processing on all the feature vectors based on the plurality of feature vectors obtained in each preset time period to obtain a plurality of candidate clustering results may include:
and clustering all the feature vectors based on the processed feature vectors corresponding to each preset time period to obtain a plurality of candidate clustering results.
In a further embodiment of the present invention, a computer-readable storage medium is further provided, in which a computer program is stored, and the computer program, when executed by a processor, implements the steps of the method for identifying a user based on domain name system traffic record data according to any of the above embodiments.
It can be seen that in the solution provided in the embodiment of the present invention, the electronic device may obtain network behavior information corresponding to a plurality of internet protocol addresses in a plurality of preset time periods, construct a feature vector corresponding to each internet protocol address in each preset time period according to the feature information in the network behavior information, obtain a plurality of feature vectors, perform clustering processing on all feature vectors based on the plurality of feature vectors obtained in each preset time period, obtain a plurality of candidate clustering results, determine a target clustering result from the plurality of candidate clustering results based on a preset rule, determine a user identifier corresponding to each cluster in the target clustering result, obtain target network behavior information of a user to be identified, determine a target feature vector corresponding to the target network behavior information, determine a target cluster in the target clustering result corresponding to the target feature vector, and determine a user identifier corresponding to the target cluster as a user identifier of the user to be identified, in a preset time period, the internet protocol address used by each user is unchanged, each feature vector comprises a preset number of dimensions, and each dimension corresponds to a preset domain name. The feature vectors can be constructed based on the acquired network behavior information, and can identify the features of the user network behavior information, so that the feature vectors are clustered, the feature vectors with similar rules can be classified into one class, namely, the periodic rules of the user network behavior information are found, and then the user identification corresponding to each cluster can be determined through a target clustering result, and the user can be identified to the greatest extent under the condition that the IP address of the user is continuously changed.
In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, cause the processes or functions described in accordance with the embodiments of the invention to occur, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, from one website site, computer, server, or data center to another website site, computer, server, or data center via wired (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that incorporates one or more of the available media. The usable medium may be a magnetic medium (e.g., floppy Disk, hard Disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., Solid State Disk (SSD)), among others.
It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.
All the embodiments in the present specification are described in a related manner, and the same and similar parts among the embodiments may be referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the embodiments of the apparatus, the electronic device, and the computer-readable storage medium, since they are substantially similar to the embodiments of the method, the description is simple, and for the relevant points, reference may be made to the partial description of the embodiments of the method.
The above description is only for the preferred embodiment of the present invention, and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention shall fall within the protection scope of the present invention.

Claims (10)

1. A method for identifying a user based on domain name system traffic record data, comprising:
acquiring network behavior information corresponding to a plurality of internet protocol addresses in a plurality of preset time periods, wherein the internet protocol address used by each user is unchanged in each preset time period;
according to the feature information in the network behavior information, constructing a corresponding feature vector of each Internet protocol address in each preset time period to obtain a plurality of feature vectors, wherein each feature vector comprises a preset number of dimensions, and each dimension corresponds to a preset domain name;
clustering all the eigenvectors based on the plurality of eigenvectors obtained in each preset time period to obtain a plurality of candidate clustering results;
determining a target clustering result from the plurality of candidate clustering results based on a preset rule;
determining a user identifier corresponding to each cluster in the target clustering result;
acquiring target network behavior information of a user to be identified, and determining a target characteristic vector corresponding to the target network behavior information;
and determining a target cluster in the target clustering result corresponding to the target characteristic vector, and determining a user identifier corresponding to the target cluster as a user identifier of a user to be identified.
2. The method according to claim 1, wherein the step of constructing a feature vector corresponding to each ip address according to the feature information in the network behavior information comprises:
determining access information of a preset domain name contained in the network behavior information as characteristic information;
and determining the access frequency or the access frequency of each preset domain name corresponding to each Internet protocol address based on the access information, and taking the access frequency or the access frequency as the dimension of the characteristic vector corresponding to the Internet protocol address to obtain the characteristic vector corresponding to each Internet protocol address.
3. The method according to claim 1, wherein the step of clustering all the feature vectors to obtain a plurality of candidate clustering results based on a plurality of feature vectors obtained in each preset time period comprises:
taking a plurality of feature vectors corresponding to the first preset time period as a plurality of initial centroids, and calculating the distance from all the feature vectors to each initial centroid;
traversing all the feature vectors, dividing the feature vectors into clusters with the closest initial centroids according to the sequence of the distance between each feature vector and each initial centroid from small to large, when the number of the feature vectors in the cluster with the closest initial centroid reaches a maximum limit value, dividing subsequent feature vectors into the cluster with the next closest initial centroid until all the feature vectors are divided, and obtaining a plurality of clusters, wherein the maximum limit value is the number of the preset time period;
traversing each feature vector according to the sequence of the distance between each feature vector and the initial centroid of the cluster where the feature vector is located from large to small on the basis of the plurality of clusters, judging whether the cluster where the initial centroid with the minimum distance is located is the cluster where the feature vector is located currently, and judging whether the variance of two clusters which are exchanged is reduced after the feature vector is exchanged with the feature vector which is farthest from the initial centroid in the cluster with the minimum distance if the feature vector is not located;
if so, exchanging the feature vector with the feature vector which is farthest from the initial centroid of the cluster in the minimum distance cluster until no exchangeable feature vector exists or the exchange frequency reaches a preset maximum exchange frequency to obtain a candidate clustering result;
determining a plurality of initial centroids for clustering processing based on the feature vectors of a plurality of internet protocol addresses of the next preset time period, wherein the number of the plurality of initial centroids is equal to the number of the plurality of initial centroids corresponding to the first preset time period;
and returning to execute the step of calculating the distance from all the feature vectors to each initial centroid, and returning to execute the step of calculating the distance from all the feature vectors to each initial centroid each time to obtain a candidate clustering result corresponding to a corresponding preset time period until the next preset time period does not exist to obtain a plurality of candidate clustering results.
4. The method according to claim 1, wherein the step of determining the target clustering result from the plurality of candidate clustering results based on a preset rule comprises:
respectively calculating the average variance of the plurality of candidate clustering results;
and comparing the average variances, and determining the candidate clustering result with the minimum average variance as a target clustering result.
5. The method according to claim 1, wherein before the step of clustering all the feature vectors based on the feature vectors obtained in each preset time period to obtain a plurality of candidate clustering results, the method further comprises:
normalizing each feature vector to obtain a plurality of processed feature vectors;
the step of clustering all the eigenvectors based on the plurality of eigenvectors obtained in each preset time period to obtain a plurality of candidate clustering results comprises the following steps:
and based on the processed plurality of feature vectors corresponding to each preset time period, performing clustering processing on all the feature vectors obtained in all the preset time periods to obtain a plurality of candidate clustering results.
6. An apparatus for identifying a user based on domain name system traffic record data, comprising:
the information acquisition module is used for acquiring network behavior information corresponding to a plurality of internet protocol addresses in a plurality of preset time periods, wherein the internet protocol address used by each user is unchanged in each preset time period;
the vector construction module is used for constructing a corresponding feature vector of each internet protocol address in each preset time period according to feature information in the network behavior information to obtain a plurality of feature vectors, wherein each feature vector comprises a preset number of dimensions, and each dimension corresponds to a preset domain name;
the vector processing module is used for clustering all the characteristic vectors based on the plurality of characteristic vectors obtained in each preset time period to obtain a plurality of candidate clustering results;
the result determining module is used for determining a target clustering result from the candidate clustering results based on a preset rule;
an identifier determining module, configured to determine a user identifier corresponding to each cluster in the target clustering result;
the vector determination module is used for acquiring target network behavior information of a user to be identified and determining a target characteristic vector corresponding to the target network behavior information;
and the user identification module is used for determining a target cluster in the target clustering result corresponding to the target characteristic vector and determining a user identifier corresponding to the target cluster as the user identifier of the user to be identified.
7. The apparatus of claim 6, wherein the vector construction module comprises:
the information determining submodule is used for determining that the access information of the preset domain name contained in the network behavior information is characteristic information;
and the vector construction submodule is used for determining the access frequency or the access frequency of each preset domain name corresponding to each internet protocol address based on the access information, and obtaining the characteristic vector corresponding to each internet protocol address as the dimension of the characteristic vector corresponding to the internet protocol address.
8. The apparatus of claim 6, wherein the vector processing module comprises:
the calculation submodule is used for calculating the distance from all the characteristic vectors to each initial centroid by taking the plurality of characteristic vectors corresponding to the first preset time period as a plurality of initial centroids;
the vector dividing submodule is used for traversing all the feature vectors, dividing the feature vectors into clusters with the closest initial centroids according to the sequence that the distance between each feature vector and each initial centroid is from small to large, dividing subsequent feature vectors into clusters with the next closest initial centroid when the number of the feature vectors in the cluster with the closest initial centroid reaches a maximum limit value until all the feature vectors are divided, and obtaining a plurality of clusters, wherein the maximum limit value is the number of the preset time period;
the judging submodule is used for traversing each feature vector according to the sequence of the distance between each feature vector and the initial centroid of the cluster where the feature vector is located from large to small on the basis of the plurality of clusters, judging whether the cluster where the initial centroid with the minimum distance is located is the cluster where the feature vector is located currently or not for each feature vector, and judging whether the variance of two clusters which are exchanged is reduced or not after the feature vector is exchanged with the feature vector which is farthest away from the initial centroid of the cluster in the minimum distance cluster if the cluster is not located;
the vector processing submodule is used for exchanging the vector to be exchanged with the feature vector which is farthest away from the initial mass center of the cluster in the minimum distance cluster if the vector to be exchanged is the minimum distance cluster, and obtaining a candidate clustering result until no exchangeable feature vector exists or the exchange frequency reaches the preset maximum exchange frequency;
an initial centroid determining submodule, configured to determine a plurality of initial centroids for clustering processing based on feature vectors of a plurality of internet protocol addresses of a next preset time period, where a number of the plurality of initial centroids is equal to a number of the plurality of initial centroids in a first preset time period;
and the circulation submodule is used for triggering the calculation submodule to calculate the distance from all the characteristic vectors to each initial centroid, and the calculation submodule is used for calculating the distance from all the characteristic vectors to each initial centroid each time to obtain a candidate clustering result corresponding to a corresponding preset time period until the next preset time period does not exist to obtain a plurality of candidate clustering results.
9. The apparatus of claim 6, wherein the result determination module comprises:
the variance calculation submodule is used for calculating the average variance of the candidate clustering results respectively;
and the result determining submodule is used for comparing the average variances and determining the candidate clustering result with the minimum average variance as the target clustering result.
10. The apparatus of claim 6, further comprising:
the normalization processing module is used for performing normalization processing on each feature vector to obtain a plurality of processed feature vectors before the vector processing module performs clustering processing on all feature vectors based on a plurality of feature vectors obtained in each preset time period to obtain a plurality of candidate clustering results;
the vector processing module comprises:
and the clustering processing submodule is used for clustering all the eigenvectors based on the plurality of processed eigenvectors corresponding to each preset time period to obtain a plurality of candidate clustering results.
CN201911053713.3A 2019-10-31 2019-10-31 Method and device for identifying user based on domain name system flow record data Active CN110807487B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911053713.3A CN110807487B (en) 2019-10-31 2019-10-31 Method and device for identifying user based on domain name system flow record data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911053713.3A CN110807487B (en) 2019-10-31 2019-10-31 Method and device for identifying user based on domain name system flow record data

Publications (2)

Publication Number Publication Date
CN110807487A true CN110807487A (en) 2020-02-18
CN110807487B CN110807487B (en) 2023-01-17

Family

ID=69489828

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911053713.3A Active CN110807487B (en) 2019-10-31 2019-10-31 Method and device for identifying user based on domain name system flow record data

Country Status (1)

Country Link
CN (1) CN110807487B (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111405080A (en) * 2020-03-09 2020-07-10 北京冠程科技有限公司 Terminal IP management system and user behavior auditing method based on same
CN112699955A (en) * 2021-01-08 2021-04-23 广州新科佳都科技有限公司 User classification method, device, equipment and storage medium
CN113014573A (en) * 2021-02-23 2021-06-22 杭州安恒信息技术股份有限公司 Monitoring method, system, electronic device and storage medium of DNS (Domain name Server)
CN114386502A (en) * 2022-01-07 2022-04-22 北京点众科技股份有限公司 Method, apparatus and storage medium for cluster analysis of fast-applying users
CN114638316A (en) * 2022-03-30 2022-06-17 大唐融合通信股份有限公司 Data clustering method, device and equipment
CN114679386A (en) * 2022-05-25 2022-06-28 杭州海康威视数字技术股份有限公司 Cloud-edge cooperative Internet of things device role judgment and management method, system and device
CN115603947A (en) * 2022-09-15 2023-01-13 北京百度网讯科技有限公司(Cn) Abnormal access detection method and device

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105373614A (en) * 2015-11-24 2016-03-02 中国科学院深圳先进技术研究院 Sub-user identification method and system based on user account
US20160269361A1 (en) * 2013-11-01 2016-09-15 Beijing Qihoo Technology Company Limited Method and device for recognizing an ip address of a specified category, a defense method and system
CN109885684A (en) * 2019-01-31 2019-06-14 腾讯科技(深圳)有限公司 One type cluster processing method and processing device

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160269361A1 (en) * 2013-11-01 2016-09-15 Beijing Qihoo Technology Company Limited Method and device for recognizing an ip address of a specified category, a defense method and system
CN105373614A (en) * 2015-11-24 2016-03-02 中国科学院深圳先进技术研究院 Sub-user identification method and system based on user account
CN109885684A (en) * 2019-01-31 2019-06-14 腾讯科技(深圳)有限公司 One type cluster processing method and processing device

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
MATTHIAS KIRCHLER ETC.: "Tracked Without a Trace: Linking Sessions of Users by Unsupervised Learning of Patterns in Their DNS Traffic", 《AISEC "16: PROCEEDINGS OF THE 2016 ACM WORKSHOP ON ARTIFICIAL INTELLIGENCE AND SECURITY》 *

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111405080A (en) * 2020-03-09 2020-07-10 北京冠程科技有限公司 Terminal IP management system and user behavior auditing method based on same
CN112699955A (en) * 2021-01-08 2021-04-23 广州新科佳都科技有限公司 User classification method, device, equipment and storage medium
CN113014573A (en) * 2021-02-23 2021-06-22 杭州安恒信息技术股份有限公司 Monitoring method, system, electronic device and storage medium of DNS (Domain name Server)
CN114386502A (en) * 2022-01-07 2022-04-22 北京点众科技股份有限公司 Method, apparatus and storage medium for cluster analysis of fast-applying users
CN114638316A (en) * 2022-03-30 2022-06-17 大唐融合通信股份有限公司 Data clustering method, device and equipment
CN114679386A (en) * 2022-05-25 2022-06-28 杭州海康威视数字技术股份有限公司 Cloud-edge cooperative Internet of things device role judgment and management method, system and device
CN114679386B (en) * 2022-05-25 2022-08-05 杭州海康威视数字技术股份有限公司 Cloud-edge cooperative Internet of things device role judgment and management method, system and device
CN115603947A (en) * 2022-09-15 2023-01-13 北京百度网讯科技有限公司(Cn) Abnormal access detection method and device

Also Published As

Publication number Publication date
CN110807487B (en) 2023-01-17

Similar Documents

Publication Publication Date Title
CN110807487B (en) Method and device for identifying user based on domain name system flow record data
CN110162695B (en) Information pushing method and equipment
US10621493B2 (en) Multiple record linkage algorithm selector
CN110460587B (en) Abnormal account detection method and device and computer storage medium
US20160065534A1 (en) System for correlation of domain names
WO2020037105A1 (en) Identification and application of hyperparameters for machine learning
EP3396558B1 (en) Method for user identifier processing, terminal and nonvolatile computer readable storage medium thereof
CN111062013B (en) Account filtering method and device, electronic equipment and machine-readable storage medium
CN110083475B (en) Abnormal data detection method and device
US20140101124A1 (en) System and method for recursively traversing the internet and other sources to identify, gather, curate, adjudicate, and qualify business identity and related data
CN108366012B (en) Social relationship establishing method and device and electronic equipment
CN110830445B (en) Method and device for identifying abnormal access object
CN109190014B (en) Regular expression generation method and device and electronic equipment
US11409770B2 (en) Multi-distance similarity analysis with tri-point arbitration
CN109450969B (en) Method and device for acquiring data from third-party data source server and server
US20230205755A1 (en) Methods and systems for improved search for data loss prevention
CN110213255B (en) Method and device for detecting Trojan horse of host and electronic equipment
CN108804550B (en) Query term expansion method and device and electronic equipment
CN111209929A (en) Access data processing method and device, computer equipment and storage medium
CN108154024B (en) Data retrieval method and device and electronic equipment
CN113315851A (en) Domain name detection method, device and storage medium
CN114297037A (en) Alarm clustering method and device
CN113282831A (en) Search information recommendation method and device, electronic equipment and storage medium
CN110120918B (en) Identification analysis method and device
CN113127693A (en) Traffic data packet statistical method, device, equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant