CN105574539A - DNS log analysis method and apparatus - Google Patents

DNS log analysis method and apparatus Download PDF

Info

Publication number
CN105574539A
CN105574539A CN201510920374.XA CN201510920374A CN105574539A CN 105574539 A CN105574539 A CN 105574539A CN 201510920374 A CN201510920374 A CN 201510920374A CN 105574539 A CN105574539 A CN 105574539A
Authority
CN
China
Prior art keywords
vector
information
matrix
vectorial
value
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201510920374.XA
Other languages
Chinese (zh)
Other versions
CN105574539B (en
Inventor
刘千仞
周光涛
孙莉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China United Network Communications Group Co Ltd
Original Assignee
China United Network Communications Group Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China United Network Communications Group Co Ltd filed Critical China United Network Communications Group Co Ltd
Priority to CN201510920374.XA priority Critical patent/CN105574539B/en
Publication of CN105574539A publication Critical patent/CN105574539A/en
Application granted granted Critical
Publication of CN105574539B publication Critical patent/CN105574539B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23213Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/355Class or cluster creation or modification

Abstract

Embodiments of the invention provide a DNS log analysis method and apparatus to at least solve the problem of incapability of simply and effectively determining a reasonable K value in a DNS log analysis process in the prior art. The method comprises the steps of obtaining a DNS log and preprocessing the DNS log to obtain a preprocessed log text, wherein the preprocessed log text contains at least one piece of text information, and each piece of the text information contains first information corresponding to the text information; extracting the first information contained in each piece of the text information from the preprocessed log text and constructing an eigenvector matrix of the first information; according to the eigenvector matrix of the first information, determining a corresponding K value when the first information contained in each piece of the text information is subjected to K-means clustering; and according to the K value, performing K-means clustering on the first information contained in each piece of the text information to obtain a clustering result. The DNS log analysis method and apparatus are suitable for the technical field of internet.

Description

A kind of DNS log analysis method and device
Technical field
The present invention relates to Internet technical field, (English: domainnamesystem is called for short: DNS) log analysis method and device to particularly relate to a kind of domain name system.
Background technology
DNS is as the first entrance of internet, and for the host assignment domain name addresses on Internet and procotol, (English: internetprotocol is called for short: IP) address, the internet of any architecture all be unable to do without DNS.Therefore, the correlative study based on DNS comes into one's own day by day.Internet firm and operator carry out further investigation for DNS one after another, and at present, the DNS analytical work based on large data platform becomes an emphasis direction of research.
Wherein, the DNS access log containing abundant information has very high tap value, traditional DNS log analysis flow process is as follows: first obtain DNS access log by logging tools and preserve, secondly journal file is processed, extract useful data, last for data analysis, reach a conclusion.Wherein, have a variety of to the mode that journal file processes, wherein a kind of effectively method is clustering algorithm, and a series of document is polymerized to multiple bunches by clustering algorithm, its target is similar as far as possible between the document in requirement bunch, and bunch between document between dissimilar as far as possible.K-means is a kind of important clustering algorithm.K-means clustering algorithm speed is fast, visual result easy to understand, but the cluster result impact of the distribution at initialized bunch of center on K-means is very large.The input parameter of the algorithm often of number K meanwhile bunch, and the reasonable value of K is often difficult to infer, some K values are had to determine algorithm at present, but computing is comparatively complicated, need repeatedly cluster to determine rational K value, or need priori, and be applied in effect above DNS journal file and not obvious.
Therefore, how simple and effectively in DNS log analysis process determine rational K value, thus better promote Clustering Effect, become current problem demanding prompt solution.
Summary of the invention
Embodiments of the invention provide a kind of DNS log analysis method and device, with at least to solve in prior art cannot in DNS log analysis process the simple and effective problem determining rational K value.
For achieving the above object, embodiments of the invention adopt following technical scheme:
First aspect, provide a kind of domain name system DNS log analysis method, described method comprises:
Obtain DNS daily record and carry out pre-service to described DNS daily record, obtain pretreated daily record text, comprise at least one text message in described pretreated daily record text, every text message comprises the first information corresponding to text information;
From described pretreated daily record text, extract the described first information that described every text message comprises, build the eigenvectors matrix of the described first information;
According to the eigenvectors matrix of the described first information, determine the K value that the described first information comprised described every text message is corresponding when carrying out K-means cluster;
According to described K value, K-means cluster is carried out to the described first information that described every text message comprises, obtain cluster result.
Based on the DNS log analysis method that the embodiment of the present invention provides, due to after the extraction first information, the eigenvectors matrix of this first information can be built, and then corresponding K value when determining to carry out K-means cluster to the first information according to this eigenvectors matrix, therefore computing is simply effective, thus better can promote Clustering Effect.
Second aspect, provides a kind of domain name system DNS log analysis device, and described DNS log analysis device comprises: acquiring unit, construction unit, determining unit and cluster cell;
Described acquiring unit, for obtaining DNS daily record and carrying out pre-service to described DNS daily record, obtain pretreated daily record text, comprise at least one text message in described pretreated daily record text, every text message comprises the first information corresponding to text information;
Described construction unit, for extracting the described first information that described every text message comprises from described pretreated daily record text, builds the eigenvectors matrix of the described first information;
Described determining unit, for the eigenvectors matrix according to the described first information, determines the K value that the described first information comprised described every text message is corresponding when carrying out K-means cluster;
Described cluster cell, for carrying out K-means cluster according to described K value to the described first information that described every text message comprises, obtains cluster result.
Based on the DNS log analysis device that the embodiment of the present invention provides, due to after the extraction first information, the eigenvectors matrix of this first information can be built, and then corresponding K value when determining to carry out K-means cluster to the first information according to this eigenvectors matrix, therefore computing is simply effective, thus better can promote Clustering Effect.
Accompanying drawing explanation
The DNS log analysis method schematic flow sheet one that Fig. 1 provides for the embodiment of the present invention;
The DNS log analysis method schematic flow sheet two that Fig. 2 provides for the embodiment of the present invention;
Fig. 3 provides the algorithm flow chart of the process of defining K value really for the embodiment of the present invention;
The DNS log analysis method schematic flow sheet three that Fig. 4 provides for the embodiment of the present invention;
The DNS log analysis apparatus structure schematic diagram that Fig. 5 provides for the embodiment of the present invention.
Embodiment
Below in conjunction with the accompanying drawing in the embodiment of the present invention, be clearly and completely described the technical scheme in the embodiment of the present invention, obviously, described embodiment is only the present invention's part embodiment, instead of whole embodiments.Based on the embodiment in the present invention, those of ordinary skill in the art, not making the every other embodiment obtained under creative work prerequisite, belong to the scope of protection of the invention.
For the ease of the technical scheme of the clear description embodiment of the present invention, in an embodiment of the present invention, have employed the printed words such as " first ", " second " to distinguish the substantially identical identical entry of function and efficacy or similar item, it will be appreciated by those skilled in the art that the printed words such as " first ", " second " do not limit quantity and execution order.
In addition, in embodiments of the present invention, the word such as " example ", " such as " does example, illustration or explanation for expression.Be described to " example ", any embodiment of " such as " or design proposal in the application should not be interpreted as than other embodiment or design proposal more preferably or have more advantage.Specifically, the word such as " example ", " such as " is used to be intended to present concept in a concrete fashion.
Embodiment one,
The embodiment of the present invention provides a kind of DNS log analysis method, as shown in Figure 1, comprises step S101-S104:
S101, DNS log analysis device obtains DNS daily record and carries out pre-service to this DNS daily record, obtains pretreated daily record text.
Wherein, comprise at least one text message in pretreated daily record text, every text message comprises the first information corresponding to text information.
S102, DNS log analysis device extracts the first information that every text message comprises from pretreated daily record text, builds the eigenvectors matrix of the first information.
S103, DNS log analysis device, according to the eigenvectors matrix of the first information, determines the K value that the first information comprised every text message is corresponding when carrying out K-means cluster.
S104, DNS log analysis device carries out K-means cluster according to K value to the first information that every text message comprises, and obtains cluster result.
Concrete, in embodiment of the present invention step S101:
DNS log analysis device can carry out pre-service to this DNS daily record after acquisition DNS daily record, this preprocessing process specifically can include but not limited to: extracted by the useful information in DNS daily record according to space, punctuate or special symbol etc., is built into the daily record document of a new simplification version.This useful information can be specifically source IP address information, domain-name information, temporal information etc., and the embodiment of the present invention does not do concrete restriction to this.
It should be noted that, after pre-service, the entry that the information such as the source IP address in DNS daily record, domain name can be complete at last, therefore can not need to carry out further participle again, can directly use.
In the embodiment of the present invention, the first information can be specifically source IP address information or domain-name information etc., and the embodiment of the present invention does not do concrete restriction to this.
Wherein, source IP address information specifically can comprise the one or more combination in the information such as inquiry total degree in the stipulated time, nslookup number, query time interval and inquiry multiplicity; Domain-name information specifically can comprise the stipulated time and be queried one or more combination in the information such as total degree, inquiry IP number, query time interval and inquiry multiplicity, and the embodiment of the present invention does not do concrete restriction to this.
Concrete, in embodiment of the present invention step S102:
Because LUCENE instrument is an information retrieval tool storehouse flexibly, can be embedded into easily in various application and realizes full-text index rope function.Therefore, in the embodiment of the present invention, can by LUCENE instrument by the first information index building in pretreated daily record text.Such as, as shown in Table 1, respectively by source IP address, the access time, these three information of domain name set up 3 indexes, can carry out index like this to these data, and convenient inquiry, is convenient to data management, simultaneously convenient these information extracted in daily record text.
Table one
Wherein, storage mode " Field.Store.YES " represents storage; Indexed mode " Field.Index.NOT_ANALYZED " represents not index in classification.
And then the source IP address field in table one is for characterizing the IP address of calling party, and indexed mode is not index in classification, storage, for supporting IP-based inquiry, and the IP of display access user when supporting nslookup; Access time field in table one is used for the time on characterizing consumer access domain name date, and indexed mode is not index in classification, storage, for supporting the inquiry based on date-time, and supports to show query time during inquiry; Domain name field is used for the domain-name information of characterizing consumer access, and indexed mode is not index in classification, storage, for supporting the inquiry based on domain name, and supports to show domain-name information during inquiry.
Concrete, in embodiment of the present invention step S103:
In a kind of possible implementation, as shown in Figure 2, DNS log analysis device is according to the eigenvectors matrix of the first information, determine the K value (step S103) that the first information comprised every text message is corresponding when carrying out K-means cluster specifically can comprise step S103a-S103c:
S103a, DNS log analysis device is random selecting vectorial B from the eigenvectors matrix A of the first information, adds in null set C by this vectorial B, obtains set C1 and vector matrix A1.
Wherein, comprise vectorial B in set C1, vector matrix A1 is the vector matrix that eigenvectors matrix A removes outside vectorial B.
S103b, with m=1, n=1 for initial value, DNS log analysis device circulation perform step S1-S3, until vector matrix A (m+1) be sky, wherein, m, n are the positive integer being not less than 1:
S1: random selecting vectorial Dm from vector matrix Am, and the similarity determining each vector in vectorial Dm and set Cn respectively.
S2: if vectorial Dm is less than predetermined threshold value with the similarity of the institute's directed quantity in set Cn, vectorial Dm is added in set Cn, obtain set C (n+1) and vector matrix A (m+1).
Wherein, comprise vectorial Dm in set C (n+1), vector matrix A (m+1) removes the vector matrix outside vectorial Dm for eigenvectors matrix Am, n be not less than 1 positive integer.
S3: if vectorial Dm is not less than predetermined threshold value with the similarity of a certain vector in set Cn, obtain vector matrix A (m+1).
K value corresponding when the number of vector is defined as carrying out K-means cluster to the first information that every text message comprises in set C (n+1) when vector matrix A (m+1) is empty by S103c, DNS log analysis device.
Wherein, above-mentionedly determine that the algorithm flow chart of the process of the K value that the first information comprised every text message is corresponding when carrying out K-means cluster can be as shown in Figure 3, that is: from vector matrix A, randomly draw a vectorial B adds in set C, and then from remaining vector matrix A-B, randomly draw a vectorial D, compute vector D and the similarity gathering vector in C.If vectorial D is less than predetermined threshold value with the similarity of the institute's directed quantity in set C, vectorial D is added in set C, otherwise from vector matrix A-B, again extract vector, iteration like this is gone down, until the number of vector no longer increases in set C, namely vector matrix A-B is empty, and the number finally gathering vector in C can be defined as final K value.
It should be noted that, the predetermined threshold value in the embodiment of the present invention may be an empirical value, and may be through the preferred value that many experiments obtains, the embodiment of the present invention does not do concrete restriction to this yet.
Preferably, step S1 specifically can realize in the following way, that is:
Random selecting vectorial Dm from vector matrix Am, and the similarity determining each vector in vectorial Dm and set Cn according to following formula (1) respectively.
s i m ( X , Y ) = c o s θ = X · Y | | X | | × | | Y | | Formula (1)
Wherein, sim (X, Y) represents the similarity of X and Y, and X represents that vectorial Dm, Y represent a vector in set Cn, || * || represent the modulus value of *.
Concrete, in embodiment of the present invention step S104:
In a kind of possible implementation, as shown in Figure 4, DNS log analysis device carries out K-means cluster according to K value to the first information that every text message comprises, and obtains cluster result (step S104), specifically can comprise step S104a-S104c:
S104a, DNS log analysis device chooses K vector as initial central point from the eigenvectors matrix A of the first information.
S104b, DNS log analysis device calculates the distance of each vector in eigenvectors matrix A to each vector in K vector respectively.
Institute's directed quantity in vector matrix A divides according to apart from minimum principle by S104c, DNS log analysis device, obtains K classification.
It should be noted that, the step S104a-S104c in the embodiment of the present invention is the elaboration to Kmeans algorithm, illustrate only the result that iteration for the first time goes out.In general, Kmeans algorithm meeting iteration repeatedly, after calculating primary class categories, again the center of each classification can be calculated, then reclassify, until the class center of correspondence does not change, the embodiment of the present invention repeats no more this, specifically can with reference to existing implementation.
Preferably, step S104b specifically can realize in the following way, that is:
DNS log analysis device calculates the distance of each vector in eigenvectors matrix A to each vector in K vector respectively according to following formula (2).
d i s t ( X , Y ) = ( Σ i = 1 N ( x i - y i ) 2 ) 1 / 2 Formula (2)
Wherein, dist (X, Y) represents the distance of X and Y, and X represents some vectors in vector matrix A, and Y represents the some vectors in K vector, represent and sue for peace after i is from 1 value to N, x irepresent i-th element value in X vector, y irepresent i-th element value in Y-direction amount.
Specifically will comprise source IP address for the first information below, in conjunction with an example, above-mentioned DNS log analysis method will be described.Suppose that IP address that the every text message extracted from pretreated daily record text comprises as shown in Table 2.
Table two
First, the eigenvectors matrix of source IP address can be built according to table two
A = 500 100 172.8 20 40 10 2160 6 400 11 216 18 80 15 1080 8 111500 1 0.77 111500 .
Secondly, according to eigenvectors matrix A, determine the K value that the source IP address comprised every text message is corresponding when carrying out K-means cluster, as follows:
The first, from A, random selecting vectorial B={500100172.820}, adds in null set C, can obtain gathering C1={500100172.820}, and then set A becomes
A 1 = 40 10 2160 6 400 11 216 18 80 15 1080 8 111500 1 0.77 111500 .
The second, then from A1 random selecting vectorial D1={801510808}, compare the similarity of institute's directed quantity in D1 and set C1, can obtain according to formula (1):
s i m ( D 1 , B ) = c o s θ = D 1 · B | | D 1 | | × | | B | | = { 80 15 1080 8 } · { 500 100 172.8 20 } | | { 80 15 1080 8 } | | × | | { 500 100 172.8 20 } | | = 0.39.
Suppose predetermined threshold value a=0.5, due to 0.39 < 0.5, therefore can illustrate that D1 and B can be divided into two classifications, D1 is added in set C1, then gather C1 and become C 2 = 500 100 172.8 20 80 15 1080 8 , Set A 1 then becomes A 2 = 40 10 2160 6 400 11 216 18 111500 1 0.77 111500 .
3rd, then from A2 random selecting vectorial D2={401021606}, compare the similarity of institute's directed quantity in D2 and set C2, can obtain according to formula (1):
s i m ( D 2 , B ) = c o s &theta; = D 2 &CenterDot; B | | D 2 | | &times; | | B | | = { 40 10 2160 6 } &CenterDot; { 500 100 172.8 20 } | | { 40 10 2160 6 } | | &times; | | { 500 100 172.8 20 } | | = 0.33 ,
s i m ( D 2 , D 1 ) = c o s &theta; = D 2 &CenterDot; D 1 | | D 2 | | &times; | | D 1 | | = { 40 10 2160 6 } &CenterDot; { 80 15 1080 8 } | | { 40 10 2160 6 } | | &times; | | { 80 15 1080 8 } | | = 0.99.
Due to 0.33 < 0.5, therefore can illustrate that D2 and B can be divided into two classifications; But due to 0.99 > 0.5, therefore can illustrate that D2 and D1 belongs to same classification, do not meet D2 and be all less than the condition of predetermined threshold value with the similarity of institute's directed quantity in set C2, then D2 need not put in set C2.Now, set C2 does not change, and set A 2 becomes
A 3 = 400 11 216 18 111500 1 0.77 111500 .
4th, then from A3 random selecting vectorial D3={4001121618}, compare the similarity of institute's directed quantity in D3 and set C2, can obtain according to formula (1):
s i m ( D 3 , B ) = c o s &theta; = D 3 &CenterDot; B | | D 3 | | &times; | | B | | = { 400 11 216 18 } &CenterDot; { 500 100 172.8 20 } | | { 400 11 216 18 } | | &times; | | { 500 100 172.8 20 } | | = 0.97 ,
s i m ( D 3 , D 1 ) = c o s &theta; = D 3 &CenterDot; D 1 | | D 3 | | &times; | | D 1 | | = { 400 11 216 18 } &CenterDot; { 80 15 1080 8 } | | { 400 11 216 18 } | | &times; | | { 80 15 1080 8 } | | = 0.48.
Due to 0.48 < 0.5, therefore can illustrate that D3 and D1 can be divided into two classifications; But due to 0.97 > 0.5, therefore can illustrate that D3 and B belongs to same classification, do not meet D3 and be all less than the condition of predetermined threshold value with the similarity of institute's directed quantity in set C2, then D3 need not put in set C2.Now, set C2 does not change, and set A 3 becomes A4={11150010.77111500}.
5th, get last vectorial D4={11150010.77111500}, compare the similarity of institute's directed quantity in D4 and set C2, can obtain according to formula (1):
s i m ( D 4 , B ) = c o s &theta; = D 4 &CenterDot; B | | D 4 | | &times; | | B | | = { 111500 1 0.77 111500 } &CenterDot; { 500 100 172.8 20 } | | { 111500 1 0.77 111500 } | | &times; | | { 500 100 172.8 20 } | | = 0.014 s i m ( D 4 , D 1 ) = c o s &theta; = D 4 &CenterDot; D 1 | | D 4 | | &times; | | D 1 | | = { 111500 1 0.77 111500 } &CenterDot; { 80 15 1080 8 } | | { 111500 1 0.77 111500 } | | &times; | | { 80 15 1080 8 } | | = 0.057 .
Due to 0.014 < 0.5, therefore can illustrate that D4 and B can be divided into two classifications; Due to 0.057 < 0.5, therefore can illustrate that D4 and D1 can be divided into two classifications, and then D4 meets the condition that D4 and the similarity of institute's directed quantity in set C2 be all less than predetermined threshold value, is added to by D4 in set C2, then gathers C2 and become C 3 = 500 100 172.8 20 1 80 15 1080 8 111500 1 0.77 111500 .
6th, the number 3 of getting vector in C3 is final K value.
Finally, according to K value, K-means cluster is carried out to the source IP address that every text message comprises, obtain cluster result, as follows:
Be brought into by K=3 in K-means algorithm, can draw, the vector in vector matrix A is divided three classes, and is respectively
The first kind: { 11150010.77111500};
Equations of The Second Kind: 500 100 172.8 20 400 11 216 18 ;
3rd class: 40 10 2160 20 80 15 1080 8 .
It should be noted that, above-mentioned example is only specifically comprise source IP address for the first information to carry out exemplary illustration, the quantity of the text in actual scene in daily record text can much larger than the quantity in above-mentioned example, the first information also may be out of Memory, such as domain name, the embodiment of the present invention just illustrates no longer one by one at this, specifically can perform with reference to above-described embodiment.
It should be noted that, in the embodiment of the present invention, after acquisition cluster result, according to the cluster result drawn, DNS daily record can also be made a concrete analysis of further.Such as, can by the access frequency of source IP address and domain name judge this IP or domain name type, whether there is attack etc., can also analyze service traffics in conjunction with AAA platform and netflow, thus perception is carried out to user behavior, and then user's types of facial makeup in Beijing operas etc. can be drawn.The present invention is no longer described in detail this, specifically can with reference to existing implementation.
The embodiment of the present invention provides a kind of DNS log analysis method, comprise: obtain DNS daily record and pre-service is carried out to described DNS daily record, obtain pretreated daily record text, comprise at least one text message in described pretreated daily record text, every text message comprises the first information corresponding to text information; From described pretreated daily record text, extract the described first information that described every text message comprises, build the eigenvectors matrix of the described first information; According to the eigenvectors matrix of the described first information, determine the K value that the described first information comprised described every text message is corresponding when carrying out K-means cluster; According to described K value, K-means cluster is carried out to the described first information that described every text message comprises, obtain cluster result.Based on the DNS log analysis method that the embodiment of the present invention provides, after the extraction first information, the eigenvectors matrix of this first information can be built, and then corresponding K value when determining to carry out K-means cluster to the first information according to this eigenvectors matrix, computing is simply effective, thus better can promote Clustering Effect.
Embodiment two,
The embodiment of the present invention provides a kind of DNS log analysis device 50, for performing the step in the DNS log analysis method shown in above Fig. 1, Fig. 2 or Fig. 4 performed by DNS log analysis device 50.This DNS log analysis device 50 can comprise the unit corresponding to corresponding steps, example, as shown in Figure 5, can comprise: acquiring unit 501, construction unit 502, determining unit 503 and cluster cell 504.
Described acquiring unit 501, for obtaining DNS daily record and carrying out pre-service to described DNS daily record, obtain pretreated daily record text, comprise at least one text message in described pretreated daily record text, every text message comprises the first information corresponding to text information.
Described construction unit 502, for extracting the described first information that described every text message comprises from described pretreated daily record text, builds the eigenvectors matrix of the described first information.
Described determining unit 503, for the eigenvectors matrix according to the described first information, determines the K value that the described first information comprised described every text message is corresponding when carrying out K-means cluster.
Described cluster cell 504, for carrying out K-means cluster according to described K value to the described first information that described every text message comprises, obtains cluster result.
Optionally, described determining unit 503 specifically for:
Random selecting vectorial B from the eigenvectors matrix A of the described first information, described vectorial B is added in null set C, obtain set C1 and vector matrix A1, wherein, comprise described vectorial B in described set C1, described vector matrix A1 is the vector matrix that described eigenvectors matrix A removes outside described vectorial B.
With m=1, n=1 for initial value, circulation performs step S1-S3, until vector matrix A (m+1) is empty, wherein, m, n are the positive integer being not less than 1:
S1: random selecting vectorial Dm from vector matrix Am, and the similarity determining each vector in described vectorial Dm and set Cn respectively;
S2: if the similarity of described vectorial Dm and the described institute's directed quantity gathered in Cn is all less than predetermined threshold value, described vectorial Dm is added in set Cn, obtain set C (n+1) and vector matrix A (m+1), wherein, described vectorial Dm is comprised in described set C (n+1), described vector matrix A (m+1) is the vector matrix that described eigenvectors matrix Am removes outside described vectorial Dm, n be not less than 1 positive integer;
S3: if the similarity of described vectorial Dm and the described a certain vector gathered in Cn is not less than described predetermined threshold value, obtain described vector matrix A (m+1).
K value corresponding when the number of vector is defined as carrying out K-means cluster to the described first information that described every text message comprises in described set C (n+1) when being empty by described vector matrix A (m+1).
Further, described determining unit 503 specifically for:
Random selecting vectorial Dm from vector matrix Am, and the similarity determining each vector in described vectorial Dm and set Cn according to the first preset formula respectively, wherein, described first preset formula comprises:
s i m ( X , Y ) = c o s &theta; = X &CenterDot; Y | | X | | &times; | | Y | | ,
Sim (X, Y) represents the similarity of X and Y, and X represents that vectorial Dm, Y represent a vector in set Cn, || * || represent the modulus value of *.
Optionally, described cluster cell 504 specifically for:
K vector is chosen as initial central point from the eigenvectors matrix A of the described first information.
Calculate the distance of each vector in described eigenvectors matrix A to each vector in a described K vector respectively.
According to apart from minimum principle, the institute's directed quantity in described vector matrix A is divided, obtain K classification.
Further, described cluster cell 504 specifically for:
Calculate the distance of each vector in described eigenvectors matrix A to described K vector respectively according to the second preset formula, wherein, described second preset formula comprises:
d i s t ( X , Y ) = ( &Sigma; i = 1 N ( x i - y i ) 2 ) 1 / 2 ,
Dist (X, Y) represents the distance of X and Y, and X represents some vectors in vector matrix A, and Y represents the some vectors in K vector, represent and sue for peace after i is from 1 value to N, xi represents i-th element value in X vector, y irepresent i-th element value in Y-direction amount.
Be appreciated that, the DNS log analysis device 50 of the embodiment of the present invention may correspond to DNS log analysis device 50 in the DNS log analysis method shown in above-mentioned Fig. 1, Fig. 2 or Fig. 4, and the division of the unit in the DNS log analysis device 50 of the embodiment of the present invention and/or function etc. are all to realize the DNS log analysis method flow process shown in above-mentioned Fig. 1, Fig. 2 or Fig. 4, for simplicity, do not repeat them here.
Because the DNS log analysis device 50 in the embodiment of the present invention may be used for performing said method flow process, therefore, its obtainable technique effect of institute also can with reference to said method embodiment, and the embodiment of the present invention does not repeat them here.
Based on the DNS log analysis device that the embodiment of the present invention provides, due to after the extraction first information, the eigenvectors matrix of this first information can be built, and then corresponding K value when determining to carry out K-means cluster to the first information according to this eigenvectors matrix, therefore computing is simply effective, thus better can promote Clustering Effect.
Those skilled in the art can be well understood to, for convenience and simplicity of description, the device of foregoing description, only be illustrated with the division of above-mentioned each functional module, in practical application, can distribute as required and by above-mentioned functions and be completed by different functional modules, the inner structure by device is divided into different functional modules, to complete all or part of function described above.The specific works process of the system of foregoing description, device and unit, with reference to the corresponding process in preceding method embodiment, can not repeat them here.
In several embodiments that the application provides, should be understood that, disclosed system, apparatus and method, can realize by another way.Such as, device embodiment described above is only schematic, such as, the division of described module or unit, be only a kind of logic function to divide, actual can have other dividing mode when realizing, such as multiple unit or assembly can in conjunction with or another system can be integrated into, or some features can be ignored, or do not perform.Another point, shown or discussed coupling each other or direct-coupling or communication connection can be by some interfaces, and the indirect coupling of device or unit or communication connection can be electrical, machinery or other form.
The described unit illustrated as separating component or can may not be and physically separates, and the parts as unit display can be or may not be physical location, namely can be positioned at a place, or also can be distributed in multiple network element.Some or all of unit wherein can be selected according to the actual needs to realize the object of the present embodiment scheme.
In addition, each functional unit in each embodiment of the present invention can be integrated in a processing unit, also can be that the independent physics of unit exists, also can two or more unit in a unit integrated.Above-mentioned integrated unit both can adopt the form of hardware to realize, and the form of SFU software functional unit also can be adopted to realize.
If described integrated unit using the form of SFU software functional unit realize and as independently production marketing or use time, can be stored in a computer read/write memory medium.Based on such understanding, the part that technical scheme of the present invention contributes to prior art in essence in other words or all or part of of this technical scheme can embody with the form of software product, this computer software product is stored in a storage medium, comprising some instructions in order to make a computer equipment (can be personal computer, server, or the network equipment etc.) or processor (processor) perform all or part of step of method described in each embodiment of the present invention.And aforesaid storage medium comprises: USB flash disk, portable hard drive, ROM (read-only memory) (ROM, Read-OnlyMemory), random access memory (RAM, RandomAccessMemory), magnetic disc or CD etc. various can be program code stored medium.
The above; be only the specific embodiment of the present invention, but protection scope of the present invention is not limited thereto, is anyly familiar with those skilled in the art in the technical scope that the present invention discloses; change can be expected easily or replace, all should be encompassed within protection scope of the present invention.Therefore, protection scope of the present invention should be as the criterion with the protection domain of described claim.

Claims (10)

1. a domain name system DNS log analysis method, is characterized in that, described method comprises:
Obtain DNS daily record and carry out pre-service to described DNS daily record, obtain pretreated daily record text, comprise at least one text message in described pretreated daily record text, every text message comprises the first information corresponding to text information;
From described pretreated daily record text, extract the described first information that described every text message comprises, build the eigenvectors matrix of the described first information;
According to the eigenvectors matrix of the described first information, determine the K value that the described first information comprised described every text message is corresponding when carrying out K-means cluster;
According to described K value, K-means cluster is carried out to the described first information that described every text message comprises, obtain cluster result.
2. method according to claim 1, is characterized in that, the described eigenvectors matrix according to the described first information, determines to comprise the K value that the described first information comprised described every text message is corresponding when carrying out K-means cluster:
Random selecting vectorial B from the eigenvectors matrix A of the described first information, described vectorial B is added in null set C, obtain set C1 and vector matrix A1, wherein, comprise described vectorial B in described set C1, described vector matrix A1 is the vector matrix that described eigenvectors matrix A removes outside described vectorial B;
With m=1, n=1 for initial value, circulation performs step S1-S3, until vector matrix A (m+1) is empty, wherein, m, n are the positive integer being not less than 1:
S1: random selecting vectorial Dm from vector matrix Am, and the similarity determining each vector in described vectorial Dm and set Cn respectively;
S2: if the similarity of described vectorial Dm and the described institute's directed quantity gathered in Cn is all less than predetermined threshold value, described vectorial Dm is added in set Cn, obtain set C (n+1) and vector matrix A (m+1), wherein, described vectorial Dm is comprised in described set C (n+1), described vector matrix A (m+1) is the vector matrix that described eigenvectors matrix Am removes outside described vectorial Dm, n be not less than 1 positive integer;
S3: if the similarity of described vectorial Dm and the described a certain vector gathered in Cn is not less than described predetermined threshold value, obtain described vector matrix A (m+1);
K value corresponding when the number of vector is defined as carrying out K-means cluster to the described first information that described every text message comprises in described set C (n+1) when being empty by described vector matrix A (m+1).
3. method according to claim 2, is characterized in that, described from vector matrix Am random selecting vectorial Dm, and determine described vectorial Dm and the similarity of each vector in set Cn respectively, comprising:
Random selecting vectorial Dm from vector matrix Am, and the similarity determining each vector in described vectorial Dm and set Cn according to the first preset formula respectively, wherein, described first preset formula comprises:
s i m ( X , Y ) = c o s &theta; = X &CenterDot; Y | | X | | &times; | | Y | | ,
Sim (X, Y) represents the similarity of X and Y, and X represents that vectorial Dm, Y represent a vector in set Cn, || * || represent the modulus value of *.
4. the method according to any one of claim 1-3, is characterized in that, describedly carries out K-means cluster according to described K value to the described first information that described every text message comprises, and obtains cluster result, comprising:
K vector is chosen as initial central point from the eigenvectors matrix A of the described first information:
Calculate the distance of each vector in described eigenvectors matrix A to each vector in a described K vector respectively;
According to apart from minimum principle, the institute's directed quantity in described vector matrix A is divided, obtain K classification.
5. method according to claim 4, is characterized in that, describedly calculates each vector in described eigenvectors matrix A respectively to the distance of each vector in a described K vector, comprising:
Calculate the distance of each vector in described eigenvectors matrix A to described K vector respectively according to the second preset formula, wherein, described second preset formula comprises:
d i s t ( X , Y ) = ( &Sigma; i = 1 N ( x i - y i ) 2 ) 1 / 2 ,
Dist (X, Y) represents the distance of X and Y, and X represents some vectors in vector matrix A, and Y represents the some vectors in K vector, represent and sue for peace after i is from 1 value to N, x irepresent i-th element value in X vector, y irepresent i-th element value in Y-direction amount.
6. a domain name system DNS log analysis device, is characterized in that, described DNS log analysis device comprises: acquiring unit, construction unit, determining unit and cluster cell;
Described acquiring unit, for obtaining DNS daily record and carrying out pre-service to described DNS daily record, obtain pretreated daily record text, comprise at least one text message in described pretreated daily record text, every text message comprises the first information corresponding to text information;
Described construction unit, for extracting the described first information that described every text message comprises from described pretreated daily record text, builds the eigenvectors matrix of the described first information;
Described determining unit, for the eigenvectors matrix according to the described first information, determines the K value that the described first information comprised described every text message is corresponding when carrying out K-means cluster;
Described cluster cell, for carrying out K-means cluster according to described K value to the described first information that described every text message comprises, obtains cluster result.
7. DNS log analysis device according to claim 6, is characterized in that, described determining unit specifically for:
Random selecting vectorial B from the eigenvectors matrix A of the described first information, described vectorial B is added in null set C, obtain set C1 and vector matrix A1, wherein, comprise described vectorial B in described set C1, described vector matrix A1 is the vector matrix that described eigenvectors matrix A removes outside described vectorial B;
With m=1, n=1 for initial value, circulation performs step S1-S3, until vector matrix A (m+1) is empty, wherein, m, n are the positive integer being not less than 1:
S1: random selecting vectorial Dm from vector matrix Am, and the similarity determining each vector in described vectorial Dm and set Cn respectively;
S2: if the similarity of described vectorial Dm and the described institute's directed quantity gathered in Cn is all less than predetermined threshold value, described vectorial Dm is added in set Cn, obtain set C (n+1) and vector matrix A (m+1), wherein, described vectorial Dm is comprised in described set C (n+1), described vector matrix A (m+1) is the vector matrix that described eigenvectors matrix Am removes outside described vectorial Dm, n be not less than 1 positive integer;
S3: if the similarity of described vectorial Dm and the described a certain vector gathered in Cn is not less than described predetermined threshold value, obtain described vector matrix A (m+1);
K value corresponding when the number of vector is defined as carrying out K-means cluster to the described first information that described every text message comprises in described set C (n+1) when being empty by described vector matrix A (m+1).
8. DNS log analysis device according to claim 7, is characterized in that, described determining unit specifically for:
Random selecting vectorial Dm from vector matrix Am, and the similarity determining each vector in described vectorial Dm and set Cn according to the first preset formula respectively, wherein, described first preset formula comprises:
s i m ( X , Y ) = c o s &theta; = X &CenterDot; Y | | X | | &times; | | Y | | ,
Sim (X, Y) represents the similarity of X and Y, and X represents that vectorial Dm, Y represent a vector in set Cn, || * || represent the modulus value of *.
9. the DNS log analysis device according to any one of claim 6-8, is characterized in that, described cluster cell specifically for:
K vector is chosen as initial central point from the eigenvectors matrix A of the described first information;
Calculate the distance of each vector in described eigenvectors matrix A to each vector in a described K vector respectively;
According to apart from minimum principle, the institute's directed quantity in described vector matrix A is divided, obtain K classification.
10. DNS log analysis device according to claim 9, is characterized in that, described cluster cell specifically for:
Calculate the distance of each vector in described eigenvectors matrix A to described K vector respectively according to the second preset formula, wherein, described second preset formula comprises:
d i s t ( X , Y ) = ( &Sigma; i = 1 N ( x i - y i ) 2 ) 1 / 2 ,
Dist (X, Y) represents the distance of X and Y, and X represents some vectors in vector matrix A, and Y represents the some vectors in K vector, represent and sue for peace after i is from 1 value to N, x irepresent i-th element value in X vector, y irepresent i-th element value in Y-direction amount.
CN201510920374.XA 2015-12-11 2015-12-11 A kind of DNS log analysis methods and device Active CN105574539B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510920374.XA CN105574539B (en) 2015-12-11 2015-12-11 A kind of DNS log analysis methods and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510920374.XA CN105574539B (en) 2015-12-11 2015-12-11 A kind of DNS log analysis methods and device

Publications (2)

Publication Number Publication Date
CN105574539A true CN105574539A (en) 2016-05-11
CN105574539B CN105574539B (en) 2018-09-21

Family

ID=55884645

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510920374.XA Active CN105574539B (en) 2015-12-11 2015-12-11 A kind of DNS log analysis methods and device

Country Status (1)

Country Link
CN (1) CN105574539B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112261153A (en) * 2020-03-04 2021-01-22 腾讯科技(深圳)有限公司 Network resource management method and related device

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040241730A1 (en) * 2003-04-04 2004-12-02 Zohar Yakhini Visualizing expression data on chromosomal graphic schemes
CN101079072A (en) * 2007-06-22 2007-11-28 中国科学院研究生院 Text clustering element study method and device
US20130031059A1 (en) * 2011-07-25 2013-01-31 Yahoo! Inc. Method and system for fast similarity computation in high dimensional space
CN103647676A (en) * 2013-12-30 2014-03-19 中国科学院计算机网络信息中心 Method for processing data of domain system
CN104166982A (en) * 2014-06-30 2014-11-26 复旦大学 Image optimization clustering method based on typical correlation analysis
CN104182506A (en) * 2014-08-19 2014-12-03 浪潮(北京)电子信息产业有限公司 Log management method
CN104636449A (en) * 2015-01-27 2015-05-20 厦门大学 Distributed type big data system risk recognition method based on LSA-GCC

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040241730A1 (en) * 2003-04-04 2004-12-02 Zohar Yakhini Visualizing expression data on chromosomal graphic schemes
CN101079072A (en) * 2007-06-22 2007-11-28 中国科学院研究生院 Text clustering element study method and device
US20130031059A1 (en) * 2011-07-25 2013-01-31 Yahoo! Inc. Method and system for fast similarity computation in high dimensional space
CN103647676A (en) * 2013-12-30 2014-03-19 中国科学院计算机网络信息中心 Method for processing data of domain system
CN104166982A (en) * 2014-06-30 2014-11-26 复旦大学 Image optimization clustering method based on typical correlation analysis
CN104182506A (en) * 2014-08-19 2014-12-03 浪潮(北京)电子信息产业有限公司 Log management method
CN104636449A (en) * 2015-01-27 2015-05-20 厦门大学 Distributed type big data system risk recognition method based on LSA-GCC

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112261153A (en) * 2020-03-04 2021-01-22 腾讯科技(深圳)有限公司 Network resource management method and related device

Also Published As

Publication number Publication date
CN105574539B (en) 2018-09-21

Similar Documents

Publication Publication Date Title
CN105404699A (en) Method, device and server for searching articles of finance and economics
CN102521248A (en) Network user classification method and device
CN104376406A (en) Enterprise innovation resource management and analysis system and method based on big data
CN111090807B (en) Knowledge graph-based user identification method and device
Sharma et al. K-modes clustering algorithm for categorical data
CN104462396B (en) Character string processing method and device
CN103838754A (en) Information searching device and method
CN104796300B (en) A kind of packet feature extracting method and device
CN104361092A (en) Searching method and device
CN110753065B (en) Network behavior detection method, device, equipment and storage medium
CN105302807A (en) Method and apparatus for obtaining information category
CN104731828A (en) Interdisciplinary document similarity calculation method and interdisciplinary document similarity calculation device
CN104484392A (en) Method and device for generating database query statement
CN109241392A (en) Recognition methods, device, system and the storage medium of target word
CN103077254A (en) Webpage acquiring method and device
CN104750791A (en) Image retrieval method and device
CN105045790A (en) Graph data search system, method and device
CN104980462A (en) Distributed computation method, distributed computation device and distributed computation system
CN103823892A (en) Method and device of determining webpage clustering mode
CN111061837A (en) Topic identification method, device, equipment and medium
CN104462347B (en) The sorting technique and device of keyword
CN111512304B (en) Method and system for aspect clustering on two-dimensional aspect cubes
CN103885977A (en) Webpage data classification method, device and system
CN105574539A (en) DNS log analysis method and apparatus
CN104933178A (en) Official website determining method and system

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant