CN105516027B - Using identification model method for building up, the recognition methods of data on flows and device - Google Patents
Using identification model method for building up, the recognition methods of data on flows and device Download PDFInfo
- Publication number
- CN105516027B CN105516027B CN201610018242.2A CN201610018242A CN105516027B CN 105516027 B CN105516027 B CN 105516027B CN 201610018242 A CN201610018242 A CN 201610018242A CN 105516027 B CN105516027 B CN 105516027B
- Authority
- CN
- China
- Prior art keywords
- data
- host
- flows
- identification model
- relevance
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 205
- 230000008569 process Effects 0.000 claims abstract description 136
- 230000005540 biological transmission Effects 0.000 claims abstract description 31
- 238000006243 chemical reaction Methods 0.000 claims description 64
- 238000012216 screening Methods 0.000 claims description 34
- 238000013135 deep learning Methods 0.000 claims description 20
- 238000013528 artificial neural network Methods 0.000 claims description 7
- 238000012545 processing Methods 0.000 claims description 7
- 238000012217 deletion Methods 0.000 claims description 6
- 230000037430 deletion Effects 0.000 claims description 6
- 230000008859 change Effects 0.000 claims description 3
- 238000012549 training Methods 0.000 description 15
- 238000010586 diagram Methods 0.000 description 8
- 238000004458 analytical method Methods 0.000 description 7
- 230000008901 benefit Effects 0.000 description 7
- 230000004048 modification Effects 0.000 description 5
- 238000012986 modification Methods 0.000 description 5
- 238000013527 convolutional neural network Methods 0.000 description 4
- 238000011161 development Methods 0.000 description 4
- 241001269238 Data Species 0.000 description 3
- 230000000977 initiatory effect Effects 0.000 description 3
- 238000004590 computer program Methods 0.000 description 2
- 238000013136 deep learning model Methods 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 230000003442 weekly effect Effects 0.000 description 2
- 240000002853 Nelumbo nucifera Species 0.000 description 1
- 235000006508 Nelumbo nucifera Nutrition 0.000 description 1
- 235000006510 Nelumbo pentapetala Nutrition 0.000 description 1
- 244000131316 Panax pseudoginseng Species 0.000 description 1
- 235000005035 Panax pseudoginseng ssp. pseudoginseng Nutrition 0.000 description 1
- 235000003140 Panax quinquefolius Nutrition 0.000 description 1
- 230000004913 activation Effects 0.000 description 1
- 230000015572 biosynthetic process Effects 0.000 description 1
- 210000004556 brain Anatomy 0.000 description 1
- 238000012512 characterization method Methods 0.000 description 1
- 230000001149 cognitive effect Effects 0.000 description 1
- 238000013500 data storage Methods 0.000 description 1
- 238000013501 data transformation Methods 0.000 description 1
- 230000006870 function Effects 0.000 description 1
- 235000008434 ginseng Nutrition 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 238000007726 management method Methods 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 238000003672 processing method Methods 0.000 description 1
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L47/00—Traffic control in data switching networks
- H04L47/70—Admission control; Resource allocation
- H04L47/80—Actions related to the user profile or the type of traffic
- H04L47/803—Application aware
Landscapes
- Engineering & Computer Science (AREA)
- Computer Networks & Wireless Communication (AREA)
- Signal Processing (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Computer And Data Communications (AREA)
Abstract
The present invention provides a kind of using identification model method for building up, the recognition methods of data on flows and device.Using identification model method for building up, applied to the environment that host and network side node carry out data transmission, at least one host processes for having data-handling capacity are provided on host, comprising: obtain a plurality of host data of host transmission;Obtain the received a plurality of data on flows of network side node;Each host data is compared with each data on flows, to find out at least a pair of of the host data and data on flows that wherein have relevance;Each parameter to the host data and data on flows that have relevance is handled, to obtain each corresponding relationship between host processes title and data pack load corresponding to the host data for having relevance and data on flows;It is established using the corresponding relationship of each pair of host processes title and data pack load and applies identification model.Accuracy, convenience and efficiency using identification can be increased using the present invention.
Description
Technical field
The present invention relates to field of computer technology, establish more particularly to a kind of application identification model based on deep learning
The recognition methods of method and device and a kind of data on flows and device.
Background technique
In intranet environment, different applications is often in different priority.Such as answering for management class
With with P2P class application, the former can all be classified as to important application in most enterprises, and the latter is classified as limitation application.?
In different enterprises, even identical application, it is also possible to which priority is different.For example, being equally video class application, in video class public affairs
The status of department and electric business class company is different.Meanwhile the service condition for understanding each application is conducive to reasonably optimizing and configuration enterprise
Application and network in industry, to guarantee the quick efficient development transmitted and work of information.Therefore application is identified in local area network
It is very important.
One is port match, i.e., the source port or target port used data flow and the existing application-port of system
Database is compared, so that it is determined that the corresponding application of data flow.The data flow that certain applications generate is carried out using particular port
Transmission, therefore be feasible to this certain applications.The advantages of this mode is the storage and analysis for not needing mass data, also not
Need complicated algorithm, system burden very little.But actual conditions are that some ports can correspond to a variety of applications, or application is adopted
Port is simultaneously not fixed, and a variety of possibilities cause the accuracy of identification not high.
Another method is pattern match, and most widely used method at present.Pattern match is divided into two classes, Yi Leishi
In host side, according to the file characteristic that known features storehouse matching is applied, as product version, name of product, company, FileVersion,
Source filename etc..This method needs install identification software on every host, will affect user experience and host performance.Also
One kind is in network side by identifying application to characterization rules known to data stream matches.This method need artificial analysis and
Defined feature, and have newly-increased application daily, manual analysis workload is too big, does not catch up with much using newly-increased speed.
Summary of the invention
In view of the above problems, it proposes on the present invention overcomes the above problem or at least be partially solved in order to provide one kind
State problem a kind of application identification model method for building up and device based on deep learning and a kind of identification side of data on flows
Method and device.
According to an aspect of the present invention, the application identification model based on deep learning that the embodiment of the invention provides a kind of
Method for building up is provided at least one tool on the host applied to the environment that host and network side node carry out data transmission
The host processes of standby data-handling capacity, comprising:
Obtain a plurality of host data of the host transmission, wherein carried in the host in each host data to this
The host processes title that host data is handled;
Obtain the received a plurality of data on flows of the network side node, wherein the network is carried in each data on flows
Side gusset receives the data pack load when data on flows;
Each host data is compared with each data on flows, to find out at least a pair of of the host for wherein having relevance
Data and data on flows;
Each parameter to the host data and data on flows that have relevance is handled, with obtain it is each to have association
Corresponding relationship corresponding to the host data and data on flows of property between host processes title and data pack load;
It is established using the corresponding relationship of each pair of host processes title and data pack load described using identification model.
Optionally, each host data is compared with each data on flows, has relevance at least to find out wherein
A pair of of host data and data on flows, comprising:
Each host data is compared with each parameter that each data on flows carries;
It is identical according to multiple groups parameter, alternatively, identical parameters ratio is more than the comparison rules of proportion threshold value, have to find out
The host data and data on flows pair of relevance.
Optionally, the parameter that host data carries includes at least: the transmission time of host data, source IP address, source port
Number, target ip address, destination port number, handle host data process title;
The parameter that data on flows carries includes at least: receiving time, source IP address, source port number, the target of data on flows
IP address, destination port number, the data pack load of data on flows.
Optionally, identical according to multiple groups parameter, alternatively, identical parameters ratio is more than the comparison rules of proportion threshold value, to look into
The host data for having relevance and data on flows are found out to later, further includes:
The determining host data for having relevance and data on flows are further sieved to screening according to screening rule
Select the host data and data on flows pair for wherein having spurious correlation;
Have the host data and data on flows pair of spurious correlation described in deletion.
Optionally, according to screening rule to the determining host data for having relevance and data on flows to screening,
Further screening wherein has the host data and data on flows pair of spurious correlation, including at least one following:
If a host data and two or more data on flows have relevance, it is determined that this relevance is spurious correlation;
If a host data and a data on flows have relevance, but the two time difference is more than time difference threshold value,
Then determine that this relevance is spurious correlation.
Optionally, the application is established using the corresponding relationship of each pair of host processes title and data pack load identify mould
Type, comprising:
Machine language conversion is carried out to host processes title and data pack load respectively, is converted into machine recognizable machine
Data;
Corresponding relationship is further established between host processes title and data pack load in post-conversion, and is closed using the correspondence
System establishes described using identification model.
Optionally, machine language conversion is carried out to host processes title, is converted into machine recognizable machine data, wrapped
It includes:
Host processes title and since 0 and one by one incremental ordered list are mapped, by each host processes title
It is converted into corresponding natural number.
Optionally, machine language conversion is carried out to data pack load, is converted into machine recognizable machine data, comprising:
Corresponding decimal number is converted by the data pack load of hexadecimal string;
To the decimal number after conversion divided by 255, L [0,1] floating numbers are obtained, wherein L is the length of data pack load
Degree.
Optionally, the application identification model uses as follows, comprising:
The input data using identification model is obtained, is handled through convolutional layer and pond layer, generates the depth of input data
Spend feature;
The depth characteristic is sent to full articulamentum identical with neural network, and the depth characteristic is parsed;
The parsing result of the depth characteristic is transmitted to output layer by the full articulamentum, is exported outward.
Optionally, the convolutional layer and pond layer multi-layer superposition use, and superposition is more, and the depth characteristic is got over
It is deep.
Optionally, the convolutional layer and the pond layer use in pairs.
Optionally, the window dimension of the convolutional layer and the pond layer is 1*n.
According to another aspect of the present invention, the embodiment of the invention provides a kind of recognition methods of data on flows, comprising:
Receive data on flows, wherein number when network side node receives the data on flows is carried in the data on flows
According to payload package;
The data on flows is converted into the recognizable data using identification model;
The recognizable data input is described using identification model, it obtains identified data and belongs to different host processes
Probability;
The corresponding host processes of the data on flows are identified according to the obtained probability.
Optionally, the data on flows is converted into the recognizable data using identification model, comprising:
Machine language conversion is carried out to the data pack load of the data on flows, being converted into the application identification model can know
Other data.
Optionally, machine language conversion is carried out to the data pack load of the data on flows, is converted into the application identification
The identifiable data of model, comprising:
Corresponding decimal number is converted by the data pack load of hexadecimal string;
To the decimal number after conversion divided by 255, L [0,1] floating numbers are obtained, wherein L is the length of data pack load
Degree.
Optionally, the corresponding host processes of the data on flows are identified according to the obtained probability, comprising:
Maximum probability value is chosen as the judgement of the data on flows as a result, determine the corresponding host of the data on flows into
Journey title.
According to a further aspect of the invention, the embodiment of the invention provides a kind of, and the application based on deep learning identifies mould
Type establishes device, applied to the environment that host and network side node carry out data transmission, is provided at least one on the host
The host processes for having data-handling capacity, comprising:
First obtains module, suitable for obtaining a plurality of host data of the host transmission, wherein carry in each host data
There is the host processes title handled in the host the host data;
Second obtains module, is suitable for obtaining the received a plurality of data on flows of network side node, wherein each data on flows
In carry data pack load when the network side node receives the data on flows;
Comparison module wherein has relevance suitable for each host data is compared with each data on flows to find out
At least a pair of of host data and data on flows;
Third obtains module, suitable for handling each parameter to the host data and data on flows that have relevance,
It is each between host processes title and data pack load corresponding to the host data for having relevance and data on flows to obtain
Corresponding relationship;
Module is established, is known suitable for establishing the application using the corresponding relationship of each pair of host processes title and data pack load
Other model.
Optionally, the comparison module is further adapted for:
Each host data is compared with each parameter that each data on flows carries;
It is identical according to multiple groups parameter, alternatively, identical parameters ratio is more than the comparison rules of proportion threshold value, have to find out
The host data and data on flows pair of relevance.
Optionally, the parameter that host data carries includes at least: the transmission time of host data, source IP address, source port
Number, target ip address, destination port number, handle host data process title;
The parameter that data on flows carries includes at least: receiving time, source IP address, source port number, the target of data on flows
IP address, destination port number, the data pack load of data on flows.
Optionally, the comparison module is further adapted for:
It is identical according to multiple groups parameter, alternatively, identical parameters ratio is more than the comparison rules of proportion threshold value, have to find out
The host data of relevance and data on flows are to later, according to screening rule to the determining host data and stream for having relevance
Data are measured to screening, further screening wherein has the host data and data on flows pair of spurious correlation;
Have the host data and data on flows pair of spurious correlation described in deletion.
Optionally, the comparison module is further adapted for:
If a host data and two or more data on flows have relevance, it is determined that this relevance is spurious correlation;
Or
If a host data and a data on flows have relevance, but the two time difference is more than time difference threshold value,
Then determine that this relevance is spurious correlation.
Optionally, the module of establishing is further adapted for:
Machine language conversion is carried out to host processes title and data pack load respectively, is converted into machine recognizable machine
Data;
Corresponding relationship is further established between host processes title and data pack load in post-conversion, and is closed using the correspondence
System establishes described using identification model.
Optionally, the module of establishing is further adapted for:
Host processes title and since 0 and one by one incremental ordered list are mapped, by each host processes title
It is converted into corresponding natural number.
Optionally, the module of establishing is further adapted for:
Corresponding decimal number is converted by the data pack load of hexadecimal string;
To the decimal number after conversion divided by 255, L [0,1] floating numbers are obtained, wherein L is the length of data pack load
Degree.
According to a further aspect of the invention, the embodiment of the invention provides a kind of identification devices of data on flows, comprising:
Receiving module is suitable for receiving data on flows, wherein carry network side node in the data on flows and receive the stream
Measure data pack load when data;
Conversion module, suitable for the data on flows to be converted to the recognizable data using identification model;
Input module, it is described using identification model suitable for inputting the recognizable data, it obtains identified data and belongs to
The probability of different host processes;
Identification module, suitable for the probability that is obtained according to the input module identify the corresponding host of the data on flows into
Journey.
Optionally, the conversion module is further adapted for:
Machine language conversion is carried out to the data pack load of the data on flows, being converted into the application identification model can know
Other data.
Optionally, the conversion module is further adapted for:
Corresponding decimal number is converted by the data pack load of hexadecimal string;
To the decimal number after conversion divided by 255, L [0,1] floating numbers are obtained, wherein L is the length of data pack load
Degree.
Optionally, the identification module is further adapted for:
Maximum probability value is chosen as the judgement of the data on flows as a result, determining the corresponding process name of the data on flows
Claim.
In embodiments of the present invention, host data is compared respectively with data on flows, finds out relating dot therein,
And then have the host data and data on flows of relevance according to relevance screening.In host computer side, host data be by specifically into
Journey issues, and in network side, data on flows can determine that its corresponding data pack load, the embodiment of the present invention pass through master when obtaining
The relevance of machine data and data on flows, further practical corresponding host processes title and data pack load are determined in analysis
Corresponding relationship, and generated according to the corresponding relationship and apply identification model.The subsequent application identification, can be with when using data on flows
According to the corresponding process title for finding the transmission host data of the data pack load of data on flows, and then determine to send the master
The application of machine data.And the identification applied can determine the priority of the application, and then determine that the processing of the data on flows is preferential
Grade is conducive to reasonably optimizing and configures application and the network in enterprise, guarantees the efficient development of the quick transmitting and work of information.
That is, the application identification model of the initiation application of identification data on flows can be established using the embodiment of the present invention, as long as by flow number
Which kind of the application hair that can obtain the data on flows rapidly by host using identification model established according to the input embodiment of the present invention
Out, it is participated in without artificial, also need not increase identification software in host side, considerably increase the accuracy using identification, convenience
And efficiency.
The above description is only an overview of the technical scheme of the present invention, in order to better understand the technical means of the present invention,
And it can be implemented in accordance with the contents of the specification, and in order to allow above and other objects of the present invention, feature and advantage can
It is clearer and more comprehensible, the followings are specific embodiments of the present invention.
According to the following detailed description of specific embodiments of the present invention in conjunction with the accompanying drawings, those skilled in the art will be brighter
The above and other objects, advantages and features of the present invention.
Detailed description of the invention
By reading the following detailed description of the preferred embodiment, various other advantages and benefits are common for this field
Technical staff will become clear.The drawings are only for the purpose of illustrating a preferred embodiment, and is not considered as to the present invention
Limitation.And throughout the drawings, the same reference numbers will be used to refer to the same parts.In the accompanying drawings:
Fig. 1 shows the place using identification model method for building up according to an embodiment of the invention based on deep learning
Manage flow chart;
Fig. 2 shows the use flow diagrams according to an embodiment of the invention using identification model;
Fig. 3 shows a kind of recognition methods of data on flows according to an embodiment of the invention;
Fig. 4 shows the knot that device is established using identification model according to an embodiment of the invention based on deep learning
Structure schematic diagram;
Fig. 5 shows the structural schematic diagram of the identification device of data on flows according to an embodiment of the invention;And
Fig. 6 shows the establishment process according to an embodiment of the invention using identification model and subsequent flow number
According to a system schematic of identification process.
Specific embodiment
Exemplary embodiments of the present disclosure are described in more detail below with reference to accompanying drawings.Although showing the disclosure in attached drawing
Exemplary embodiment, it being understood, however, that may be realized in various forms the disclosure without should be by embodiments set forth here
It is limited.On the contrary, these embodiments are provided to facilitate a more thoroughly understanding of the present invention, and can be by the scope of the present disclosure
It is fully disclosed to those skilled in the art.
In order to solve the above technical problems, the embodiment of the present invention proposes to establish a kind of new application identification model, which knows
Other model is generated based on deep learning, and deep learning is a branch in machine learning field, and essence establishes a set of automatic point
The deep neural network of study is analysed, the mechanism that it imitates human brain carrys out learning data, takes in recent years in image, voice and text field
Obtained significant achievement.The ability that self study is also equipped with using identification model that the embodiment of the present invention is established, can be for number
According to being analyzed so that it is determined that deep learning feature out, automatic learning model parameter, independent of manual analysis and file characteristic
Library.Also, the foundation of application identification model, without installing identification software to host, does not need but also in identification process yet
Host side carries out data storage and operation, can realize unaware state to user, improve user and experience experience.
Based on the inventive concept, the embodiment of the present invention proposes a kind of application identification model foundation side based on deep learning
Method, this method are applied to the environment that host and network side node carry out data transmission, wherein the data obtained from host computer side are hereinafter
Referred to as host data, the data obtained from network side node are hereinafter data on flows.And at least one is provided on host
Item has the host processes of data-handling capacity.Fig. 1 shows answering based on deep learning according to an embodiment of the invention
With the process flow diagram of identification model method for building up.Referring to Fig. 1, this method is included at least:
Step S102, a plurality of host data of host transmission is obtained, wherein carried in host in each host data to this
The host processes title (process) that host data is handled;
Step S104, the received a plurality of data on flows of network side node is obtained, wherein carry network in each data on flows
Side gusset receives the data pack load (payload) when the data on flows;
Step S106, each host data is compared with each data on flows, has relevance extremely to find out wherein
Few a pair of host data and data on flows;
Step S108, each parameter to the host data and data on flows that have relevance is handled, it is each to obtain
To the corresponding relationship corresponding to the host data and data on flows for having relevance between host processes title and data pack load;
Step S110, it is established using the corresponding relationship of each pair of host processes title and data pack load and applies identification model.
In embodiments of the present invention, host data is compared respectively with data on flows, finds out relating dot therein,
And then have the host data and data on flows of relevance according to relevance screening.In host computer side, host data be by specifically into
Journey issues, and in network side, data on flows can determine that its corresponding data pack load, the embodiment of the present invention pass through master when obtaining
The relevance of machine data and data on flows, further practical corresponding host processes title and data pack load are determined in analysis
Corresponding relationship, and generated according to the corresponding relationship and apply identification model.The subsequent application identification, can be with when using data on flows
According to the corresponding process title for finding the transmission host data of the data pack load of data on flows, and then determine to send the master
The application of machine data.And the identification applied can determine the priority of the application, and then determine that the processing of the data on flows is preferential
Grade is conducive to reasonably optimizing and configures application and the network in enterprise, guarantees the efficient development of the quick transmitting and work of information.
That is, the application identification model of the initiation application of identification data on flows can be established using the embodiment of the present invention, as long as by flow number
Which kind of the application hair that can obtain the data on flows rapidly by host using identification model established according to the input embodiment of the present invention
Out, it is participated in without artificial, also need not increase identification software in host side, considerably increase the accuracy using identification, convenience
And efficiency.
In a preferred embodiment, step S106 each host data need to be compared with each data on flows, to look into
Find out at least a pair of of the host data and data on flows for wherein having relevance.Because host data and data on flows carry it is more
Group parameter, for data, parameter normally comprises the most crucial content or mark class content of data, therefore can be direct
Each host data will be compared with each parameter that each data on flows carries.If wherein there is host data and data on flows symbol
Conjunction multiple groups parameter is identical (such as 3 groups or more), alternatively, identical parameters ratio is more than the comparison of proportion threshold value (such as 60% or more)
Rule, it is determined that the two should have relevance, successively compare to find out the All hosts data and flow that have relevance
Data pair.
In a specific embodiment, the parameter that host data carries includes at least: the transmission time of host data, source
IP address, source port number, target ip address, destination port number, the process title for handling host data.The ginseng that data on flows carries
Number includes at least: the receiving time of data on flows, source IP address, source port number, target ip address, destination port number, flow number
According to data pack load.At this point, if the source IP address of host data, source port number, destination port number and data on flows source IP
Address, source port number, destination port number are identical, it may be considered that this host data and this data on flows should have pass
Connection property.
Further, most of associated datas can be determined using above-mentioned alignments, but still some special circumstances, this hair
Bright embodiment is referred to as the host data for having spurious correlation and data on flows pair.These feature situations if it exists, then need root
According to screening rule to the determining host data for having relevance and data on flows to screening, further screening wherein has
The host data and data on flows pair of standby spurious correlation, and delete the host data and data on flows pair for having spurious correlation.
The special circumstances for meeting meaning of the embodiment of the present invention if meeting following any one determine that the relevance is closed to be pseudo-
Connection property:
If a first, host data and two or more data on flows have relevance, it is determined that this relevance is pseudo- closes
Connection property;
After sending because of a host data, what network side node received also must be a data, if occurring two
The above associated data on flows, it was demonstrated that data transmission procedure may obscure other data, and determining relevance is not at this time
It uniquely determines.If this data, which is applied to establish, applies identification model, it is likely that it is identified same data on flows occur
Out the case where two applications, the accuracy of the result using identification model is influenced.
If a second, host data and a data on flows have relevance, but the two time difference is more than the time difference
Threshold value, it is determined that this relevance is spurious correlation.
Because of the data high-speed access of cybertimes, data transmission procedure is typically more quick, and the time is very short, if the two time
It differs too big, it is likely that host data is lost, and received network side node is not the corresponding flow of this host data
Data, for guarantee establish using identification model data accuracy, such case not with use.
In view of being to exist as machine mould, and host processes title and data pack load are equal using identification model itself
It is not machine language, therefore, when implementing, machine language can be carried out to host processes title and data pack load respectively in advance
Conversion, is converted into machine recognizable machine data, further builds between host processes title and data pack load in post-conversion
Vertical corresponding relationship, and established using the corresponding relationship and apply identification model.
Specifically, host processes title and since 0 and one by one incremental ordered list can be mapped, by each master
Machine process title is converted into corresponding natural number.Furthermore it is possible to by the data pack load of hexadecimal string be converted into corresponding ten into
Number processed;To the decimal number after conversion divided by 255, L [0,1] floating numbers are obtained, wherein L is the length of data pack load.
The embodiment of the invention provides a host processes titles and data pack load to carry out the specific of machine language conversion
Embodiment.Table one shows the corresponding relationship column of the host processes title and data pack load that obtain according to embodiments of the present invention
Table.
Table one
Payload | Process |
474554202f… | S.exe |
474554201e… | S.exe |
70a100347b… | A.exe |
803d010301… | B.exe |
160302009c… | C.exe |
803d010302… | B.exe |
504f535420… | D.exe |
Firstly, being converted to data pack load.For Payload: converting corresponding 0-255's for hexadecimal string
Decimal number, then to every number divided by 255.For each sample, what is obtained is L [0,1] floating numbers, and L is the length of load
Degree, specific conversion results refer to table two.
Table two
Secondly, being converted to host processes title.Specifically, by title be mapped to one since 0 one by one be incremented by and
Orderly table, is specifically shown in Table three.
Table three
Process | |
0 | S.exe |
1 | A.exe |
2 | B.exe |
3 | C.exe |
4 | D.exe |
…… | …… |
For convenience of retrieval, further process each in table three is substituted using Numerical Index, and then generate table four:
Table four
After training data conversion end, then the data based on identification are generated using identification model simultaneously the embodiment of the present invention
It uses.Fig. 2 shows the use flow diagrams according to an embodiment of the invention using identification model.The embodiment of the present invention
The application identification model of offer is generated based on convolutional neural networks (CNN), is a kind of deep learning model, is usually used in image recognition
Field: such as Handwritten Digital Recognition, recognition of face, picture classification.Application identification model provided in an embodiment of the present invention and biography
System CNN model is maximum to distinguish the n*n, 1*n (n for being that the window dimension in convolution sum pond is not suitable for two dimensional image instead of
It is the size of convolution or pond basic unit).
Specifically, the process for using using identification model includes:
Firstly, obtaining the input data for applying identification model, is handled through convolutional layer and pond layer, generate the depth of input data
Spend feature;The data of input first pass through several convolutional layer and pond layer.Usual convolutional layer and pond layer use in pairs, or
Convolutional layer is only used after certain depth, does not use pond layer.Superposition is more, and the network of formation is deeper.Convolutional layer and pond layer are extremely
Depth characteristic could be generated using 2 times less.
Secondly, sending depth characteristic to full articulamentum identical with neural network, and depth characteristic is parsed;It will be deep
It spends feature and is sent into full articulamentum identical with traditional neural network, the full connection number of plies should not be excessive, and general 1 to 3 layer.Final transmitting
To output layer.
Finally, the parsing result of depth characteristic is transmitted to output layer by full articulamentum, export outward.
Certainly, using identification model needs update, the model modification period as the case may be depending on.If used
GPU high performance computation can complete model modification by daily or weekly training in conjunction with real resource and business demand;If
Using CPU cluster operation model modification can be completed by weekly or monthly training in conjunction with real resource and business demand.
After being successfully established using identification model, the embodiment of the present invention can carry out the knowledge of data on flows using it
Not.Fig. 3 shows a kind of recognition methods of data on flows according to an embodiment of the invention.Referring to Fig. 3, this method is at least wrapped
It includes:
Step S302, data on flows is received, wherein when carrying network side node in data on flows and receiving the data on flows
Data pack load;
Step S304, data on flows is converted to the recognizable data using identification model;
Step S306, identification model is applied into the input of recognizable data, obtains identified data and belongs to different host processes
Probability;
Step S308, according to the obtained corresponding host processes of probability identification data on flows.
Specifically, maximum probability value can be chosen as the judgement of data on flows as a result, determining the corresponding master of data on flows
Machine process title.
It is mentioned above when being established using identification model, for convenience of reading data or application, data on flows need to be converted to
Machine language, similarly, in using the identification process for carrying out data on flows using identification model, it is also desirable to step S304 is executed,
Data on flows is converted into the recognizable data using identification model.Specifically, machine is carried out to the data pack load of data on flows
The conversion of device language, is converted into using the identifiable data of identification model.Firstly, converting the data pack load of hexadecimal string to
Corresponding decimal number.Secondly, obtaining L [0,1] floating numbers divided by 255 to the decimal number after conversion, wherein L is number
According to the length of payload package.
The embodiment of the present invention provides the specific embodiment of data on flows identification.Table is obtained after data to be identified are transformed
Data shown in five, and then it is identified using using identification model, every data to be identified, which are exported, in output layer belongs to respectively
The probability of class application, specifically refers to table five:
Table five
Finally take the corresponding Apply Names of maximum probability value as the judgement of the data as a result, as taken C.exe in example
For the recognition result of the data.
For the establishment process provided in an embodiment of the present invention using identification model and subsequent data on flows were identified
Journey becomes apparent from clearer with illustrating, the embodiment of the invention provides a complete embodiments to be described, and is specifically shown in down
Text.
1, the acquisition of data
In the training stage, data are divided into two parts: host data and data on flows
Host data is obtained from host side, including Time (time), SIP (source IP address), SPort (source port number), DIP
(target ip address), DPort (destination port number), Process are (using corresponding process title in operation, such as
" svchost.exe "), generate hexa-atomic group.Specifically it is shown in Table six:
Table six
Data on flows is obtained from network side, including Time (time), SIP (source IP address), SPort (source port number), DIP
(target ip address), DPort (destination port number), Payload (load, the spliced data of TCP network flow uplink and downlink
Packet, such as " 705ba387fe ... "), generate hexa-atomic group.Specifically it is shown in Table seven:
Table seven
In cognitive phase, input data only uses data on flows, and data on flows form is identical as the training stage.
2, the building (association of host and data on flows) of training data
Inside the field of both the above table, what two kinds of data all had has: Time, SIP, SPort, DIP, DPort.First lead to
SIP, SPort, DIP, DPort is crossed accurately to be matched.And the time that Time is recorded or uploaded due to data source difference, system
Difference leads to the delay for having certain, so needing to do further association with Time.
By taking both the above schematic table as an example, the case where needing specially treated when representing several associated datas:
(1) pass through the accurate matching of four-tuple, 1 data of host is tentatively associated with 2 datas of flow, but the time
It is close, it can not specifically determine that the application program that Process is A.exe is corresponding with which Payload by the time.Therefore this feelings
Condition is not associated with, and is added without training data.
Four-tuple in (2) two kinds of data corresponds, and time interval is little, therefore, it is considered that can correctly be associated with, it will
Training data is added in " 474554202f ... " and " S.exe ".
(3) although four-tuple corresponds, time phase difference 31 minutes, interval was excessive, therefore, it is considered that be uncorrelated data,
Without association.Here, for the threshold value of time interval depending on the real data in each local area network, usual value is no more than 10 points
Clock.
Training data in final association is shown in Table one:
Table one
3, data convert
To Payload: converting hexadecimal string to the decimal number of corresponding 0-255, then to every number divided by 255.
For each sample, what is obtained is L [0,1] floating numbers, and L is the length of load.
Example is shown in Table two:
Table two
For data to be identified, this step is only carried out with regard to much of that;And for training data, it is also necessary to be located as follows
Reason:
For the Apply Names of Process: title is mapped to a table incremental and orderly one by one since 0.Example
It is shown in Table three:
Table three
Process | |
0 | S.exe |
1 | A.exe |
2 | B.exe |
3 | C.exe |
4 | D.exe |
…… | …… |
For convenience of retrieval, further process each in table three is substituted using Numerical Index, and then generate table four:
Table four
Finally, training data is transformed to a series of corresponding associated data of Xi and Yi.
Example is shown in Table eight:
Table eight
5, identification process
Input: transformed data to be identified, transform method are identical with the transform method of training data.
Using trained CNN model parameter, by forward operations such as convolution, Chi Hua, activation, the output of final output layer is every
Item data to be identified belong to the probability of types of applications, such as table five:
Table five
Finally take the corresponding Apply Names of maximum probability value as the judgement of the data as a result, as taken C.exe in example
For the recognition result of the data.
6, recognition result post-processes
It after identifying application program, can compare, determine whether in limitation list, to take phase with known application library
Answer measure.The statistics in data is done in the application that can be used simultaneously with host each in local area network, is applied in local area network with understanding
Distributed number situation.For certain application, corresponding processing strategie is taken according to the difference of probability value.
The case where comparison and corresponding processing method such as table nine:
Table nine
The threshold value for being identified as certain applied probability height artificially can rule of thumb be set, and such as larger than 0.3 thinks probability height, be known
Other result is reliable;Think that probability is low less than 0.3, recognition result is unreliable etc..
Based on the same inventive concept, the embodiment of the invention also provides a kind of, and the application identification model based on deep learning is built
Vertical device is provided at least one on host and has data applied to the environment that host and network side node carry out data transmission
The host processes of processing capacity.Fig. 4 shows the application identification model according to an embodiment of the invention based on deep learning
Establish the structural schematic diagram of device.Referring to fig. 4, which includes at least:
First obtains module 410, suitable for obtaining a plurality of host data of host transmission, wherein carry in each host data
There is the host processes title handled in host the host data;
Second obtains module 420, is suitable for obtaining the received a plurality of data on flows of network side node, wherein each data on flows
In carry data pack load when network side node receives the data on flows;
Comparison module 430 obtains module 420 with the first acquisition module 410, second respectively and couples, is suitable for each host number
It is compared according to each data on flows, to find out at least a pair of of the host data and data on flows that wherein have relevance;
Third obtain module 440, coupled with comparison module 430, suitable for each to the host data and stream that have relevance
The parameter of amount data is handled, each to host processes corresponding to the host data for having relevance and data on flows to obtain
Corresponding relationship between title and data pack load;
Module 450 is established, module 440 is obtained with third and couples, suitable for being contained using each pair of host processes title with data
The corresponding relationship of lotus, which is established, applies identification model.
In a preferred embodiment, comparison module 430 is further adapted for:
Each host data is compared with each parameter that each data on flows carries;
It is identical according to multiple groups parameter, alternatively, identical parameters ratio is more than the comparison rules of proportion threshold value, have to find out
The host data and data on flows pair of relevance.
In a preferred embodiment,
The parameter that host data carries includes at least: transmission time, source IP address, source port number, the target of host data
IP address, destination port number, the process title for handling host data;
The parameter that data on flows carries includes at least: receiving time, source IP address, source port number, the target of data on flows
IP address, destination port number, the data pack load of data on flows.
In a preferred embodiment, comparison module 430 is further adapted for:
It is identical according to multiple groups parameter, alternatively, identical parameters ratio is more than the comparison rules of proportion threshold value, have to find out
The host data of relevance and data on flows are to later, according to screening rule to the determining host data and stream for having relevance
Data are measured to screening, further screening wherein has the host data and data on flows pair of spurious correlation;
Delete the host data and data on flows pair for having spurious correlation.
In a preferred embodiment, comparison module 430 is further adapted for:
If a host data and two or more data on flows have relevance, it is determined that this relevance is spurious correlation;
Or
If a host data and a data on flows have relevance, but the two time difference is more than time difference threshold value,
Then determine that this relevance is spurious correlation.
In a preferred embodiment, module 450 is established to be further adapted for:
Machine language conversion is carried out to host processes title and data pack load respectively, is converted into machine recognizable machine
Data;
Corresponding relationship is further established between host processes title and data pack load in post-conversion, and is closed using the correspondence
System, which establishes, applies identification model.
In a preferred embodiment, module 450 is established to be further adapted for:
Host processes title and since 0 and one by one incremental ordered list are mapped, by each host processes title
It is converted into corresponding natural number.
In a preferred embodiment, module 450 is established to be further adapted for:
Corresponding decimal number is converted by the data pack load of hexadecimal string;
To the decimal number after conversion divided by 255, L [0,1] floating numbers are obtained, wherein L is the length of data pack load
Degree.
Based on the same inventive concept, the embodiment of the invention also provides a kind of identification devices of data on flows.Fig. 5 is shown
The structural schematic diagram of the identification device of data on flows according to an embodiment of the invention.Referring to Fig. 5, which is included at least:
Receiving module 510 is suitable for receiving data on flows, wherein carry network side node in data on flows and receive the stream
Measure data pack load when data;
Conversion module 520 is coupled with receiving module 510, suitable for data on flows is converted to knowing using identification model
Other data;
Input module 530 is coupled with conversion module 520, is suitable for recognizable data input applying identification model, is obtained institute
Identification data belong to the probability of different host processes;
Identification module 540 is coupled with input module 530, and the probability suitable for being obtained according to input module 530 identifies flow number
According to corresponding host processes.
In a preferred embodiment, conversion module 520 is further adapted for:
Machine language conversion is carried out to the data pack load of data on flows, is converted into using the identifiable number of identification model
According to.
In a preferred embodiment, conversion module 520 is further adapted for:
Corresponding decimal number is converted by the data pack load of hexadecimal string;
To the decimal number after conversion divided by 255, L [0,1] floating numbers are obtained, wherein L is the length of data pack load
Degree.
In a preferred embodiment, identification module 540 is further adapted for:
Maximum probability value is chosen as the judgement of data on flows as a result, determining the corresponding process title of data on flows.
To sum up, the embodiment of the invention provides the establishment processes of application identification model and subsequent data on flows to identify
One system diagram of journey, is specifically shown in Fig. 6.Wherein, host data is entered training data relating module with data on flows and is instructed
Practice, and then enters training data conversion module, deep learning model training to set up level of application identification module.Work as input
When new data on flows, it is identified by identification data transformation module, be then inputted application program identification module with
Obtain recognition result.And then recognition result and application library are compared, and respective handling is carried out to recognition result.
Using the embodiment of the invention provides the establishment processes of application identification model and subsequent data on flows to identify
Journey can reach it is following the utility model has the advantages that
In embodiments of the present invention, host data is compared respectively with data on flows, finds out relating dot therein,
And then have the host data and data on flows of relevance according to relevance screening.In host computer side, host data be by specifically into
Journey issues, and in network side, data on flows can determine that its corresponding data pack load, the embodiment of the present invention pass through master when obtaining
The relevance of machine data and data on flows, further practical corresponding host processes title and data pack load are determined in analysis
Corresponding relationship, and generated according to the corresponding relationship and apply identification model.The subsequent application identification, can be with when using data on flows
According to the corresponding process title for finding the transmission host data of the data pack load of data on flows, and then determine to send the master
The application of machine data.And the identification applied can determine the priority of the application, and then determine that the processing of the data on flows is preferential
Grade is conducive to reasonably optimizing and configures application and the network in enterprise, guarantees the efficient development of the quick transmitting and work of information.
That is, the application identification model of the initiation application of identification data on flows can be established using the embodiment of the present invention, as long as by flow number
Which kind of the application hair that can obtain the data on flows rapidly by host using identification model established according to the input embodiment of the present invention
Out, it is participated in without artificial, also need not increase identification software in host side, considerably increase the accuracy using identification, convenience
And efficiency.
In the instructions provided here, numerous specific details are set forth.It is to be appreciated, however, that implementation of the invention
Example can be practiced without these specific details.In some instances, well known method, structure is not been shown in detail
And technology, so as not to obscure the understanding of this specification.
Similarly, it should be understood that in order to simplify the disclosure and help to understand one or more of the various inventive aspects,
Above in the description of exemplary embodiment of the present invention, each feature of the invention is grouped together into single implementation sometimes
In example, figure or descriptions thereof.However, the disclosed method should not be interpreted as reflecting the following intention: i.e. required to protect
Shield the present invention claims features more more than feature expressly recited in each claim.More precisely, as following
Claims reflect as, inventive aspect is all features less than single embodiment disclosed above.Therefore,
Thus the claims for following specific embodiment are expressly incorporated in the specific embodiment, wherein each claim itself
All as a separate embodiment of the present invention.
Those skilled in the art will understand that can be carried out adaptively to the module in the equipment in embodiment
Change and they are arranged in one or more devices different from this embodiment.It can be the module or list in embodiment
Member or component are combined into a module or unit or component, and furthermore they can be divided into multiple submodule or subelement or
Sub-component.Other than such feature and/or at least some of process or unit exclude each other, it can use any
Combination is to all features disclosed in this specification (including adjoint claim, abstract and attached drawing) and so disclosed
All process or units of what method or apparatus are combined.Unless expressly stated otherwise, this specification is (including adjoint power
Benefit require, abstract and attached drawing) disclosed in each feature can carry out generation with an alternative feature that provides the same, equivalent, or similar purpose
It replaces.
In addition, it will be appreciated by those of skill in the art that although some embodiments described herein include other embodiments
In included certain features rather than other feature, but the combination of the feature of different embodiments mean it is of the invention
Within the scope of and form different embodiments.For example, in detail in the claims, embodiment claimed it is one of any
Can in any combination mode come using.
Various component embodiments of the invention can be implemented in hardware, or to run on one or more processors
Software module realize, or be implemented in a combination thereof.It will be understood by those of skill in the art that can be used in practice
Microprocessor or digital signal processor (DSP) realize the application identification according to an embodiment of the present invention based on deep learning
The some or all function of model foundation device and some or all components in a kind of identification device of data on flows
Energy.The present invention is also implemented as some or all equipment or device journey for executing method as described herein
Sequence (for example, computer program and computer program product).Such realization program of the invention can store can in computer
It reads on medium, or may be in the form of one or more signals.Such signal can be downloaded from an internet website
It obtains, is perhaps provided on the carrier signal or is provided in any other form.
It should be noted that the above-mentioned embodiments illustrate rather than limit the invention, and ability
Field technique personnel can be designed alternative embodiment without departing from the scope of the appended claims.In the claims,
Any reference symbol between parentheses should not be configured to limitations on claims.Word "comprising" does not exclude the presence of not
Element or step listed in the claims.Word "a" or "an" located in front of the element does not exclude the presence of multiple such
Element.The present invention can be by means of including the hardware of several different elements and being come by means of properly programmed computer real
It is existing.In the unit claims listing several devices, several in these devices can be through the same hardware branch
To embody.The use of word first, second, and third does not indicate any sequence.These words can be explained and be run after fame
Claim.
So far, although those skilled in the art will appreciate that present invention has been shown and described in detail herein multiple shows
Example property embodiment still without departing from the spirit and scope of the present invention, still can according to the present disclosure directly
Determine or deduce out many other variations or modifications consistent with the principles of the invention.Therefore, the scope of the present invention is understood that and recognizes
It is set to and covers all such other variations or modifications.
Based on one aspect of the present invention, it is disclosed that A1, a kind of application identification model foundation side based on deep learning
Method is provided at least one on the host and has data applied to the environment that host and network side node carry out data transmission
The host processes of processing capacity, comprising:
Obtain a plurality of host data of the host transmission, wherein carried in the host in each host data to this
The host processes title that host data is handled;
Obtain the received a plurality of data on flows of the network side node, wherein the network is carried in each data on flows
Side gusset receives the data pack load when data on flows;
Each host data is compared with each data on flows, to find out at least a pair of of the host for wherein having relevance
Data and data on flows;
Each parameter to the host data and data on flows that have relevance is handled, with obtain it is each to have association
Corresponding relationship corresponding to the host data and data on flows of property between host processes title and data pack load;
It is established using the corresponding relationship of each pair of host processes title and data pack load described using identification model.
A2, method according to a1, wherein each host data is compared with each data on flows, to find out it
In have at least a pair of of the host data and data on flows of relevance, comprising:
Each host data is compared with each parameter that each data on flows carries;
It is identical according to multiple groups parameter, alternatively, identical parameters ratio is more than the comparison rules of proportion threshold value, have to find out
The host data and data on flows pair of relevance.
A3, the method according to A2, wherein
The parameter that host data carries includes at least: transmission time, source IP address, source port number, the target of host data
IP address, destination port number, the process title for handling host data;
The parameter that data on flows carries includes at least: receiving time, source IP address, source port number, the target of data on flows
IP address, destination port number, the data pack load of data on flows.
A4, the method according to A2, wherein it is identical according to multiple groups parameter, alternatively, identical parameters ratio is more than ratio threshold
The comparison rules of value, to find out the host data for having relevance and data on flows to later, further includes:
The determining host data for having relevance and data on flows are further sieved to screening according to screening rule
Select the host data and data on flows pair for wherein having spurious correlation;
Have the host data and data on flows pair of spurious correlation described in deletion.
A5, method according to a4, wherein according to screening rule to the determining host data and stream for having relevance
Data are measured to screening, further screening wherein has the host data and data on flows pair of spurious correlation, including following
At least one:
If a host data and two or more data on flows have relevance, it is determined that this relevance is spurious correlation;
If a host data and a data on flows have relevance, but the two time difference is more than time difference threshold value,
Then determine that this relevance is spurious correlation.
A6, according to the described in any item methods of A1 to A5, wherein utilize each pair of host processes title and data pack load
Corresponding relationship is established described using identification model, comprising:
Machine language conversion is carried out to host processes title and data pack load respectively, is converted into machine recognizable machine
Data;
Corresponding relationship is further established between host processes title and data pack load in post-conversion, and is closed using the correspondence
System establishes described using identification model.
A7, the method according to A6, wherein machine language conversion is carried out to host processes title, being converted into machine can
The machine data of identification, comprising:
Host processes title and since 0 and one by one incremental ordered list are mapped, by each host processes title
It is converted into corresponding natural number.
A8, the method according to A6 or A7, wherein machine language conversion is carried out to data pack load, is converted into machine
Identifiable machine data, comprising:
Corresponding decimal number is converted by the data pack load of hexadecimal string;
To the decimal number after conversion divided by 255, L [0,1] floating numbers are obtained, wherein L is the length of data pack load
Degree.
A9, according to the described in any item methods of A1 to A8, wherein the application identification model uses as follows, packet
It includes:
The input data using identification model is obtained, is handled through convolutional layer and pond layer, generates the depth of input data
Spend feature;
The depth characteristic is sent to full articulamentum identical with neural network, and the depth characteristic is parsed;
The parsing result of the depth characteristic is transmitted to output layer by the full articulamentum, is exported outward.
A10, the method according to A9, wherein the convolutional layer and pond layer multi-layer superposition use, and superposition is got over
More, the depth characteristic is deeper.
A11, the method according to A9 or A10, wherein the convolutional layer and the pond layer use in pairs.
A12, according to the described in any item methods of A9 to A11, wherein the window dimension of the convolutional layer and the pond layer
For 1*n.
Based on another aspect of the present invention, it is disclosed that B13, a kind of recognition methods of data on flows, comprising:
Receive data on flows, wherein number when network side node receives the data on flows is carried in the data on flows
According to payload package;
The data on flows is converted into the recognizable data using identification model;
The recognizable data input is described using identification model, it obtains identified data and belongs to different host processes
Probability;
The corresponding host processes of the data on flows are identified according to the obtained probability.
B14, method according to b13, wherein the data on flows is converted into the knowing using identification model
Other data, comprising:
Machine language conversion is carried out to the data pack load of the data on flows, being converted into the application identification model can know
Other data.
B15, method according to b14, wherein machine language is carried out to the data pack load of the data on flows and is turned
Change, be converted into the identifiable data of the application identification model, comprising:
Corresponding decimal number is converted by the data pack load of hexadecimal string;
To the decimal number after conversion divided by 255, L [0,1] floating numbers are obtained, wherein L is the length of data pack load
Degree.
B16, according to the described in any item methods of B13 to B15, wherein the flow is identified according to the obtained probability
The corresponding host processes of data, comprising:
Maximum probability value is chosen as the judgement of the data on flows as a result, determine the corresponding host of the data on flows into
Journey title.
Based on another aspect of the present invention, it is disclosed that C17, a kind of application identification model based on deep learning are built
Vertical device is provided at least one on the host and has applied to the environment that host and network side node carry out data transmission
The host processes of data-handling capacity, comprising:
First obtains module, suitable for obtaining a plurality of host data of the host transmission, wherein carry in each host data
There is the host processes title handled in the host the host data;
Second obtains module, is suitable for obtaining the received a plurality of data on flows of network side node, wherein each data on flows
In carry data pack load when the network side node receives the data on flows;
Comparison module wherein has relevance suitable for each host data is compared with each data on flows to find out
At least a pair of of host data and data on flows;
Third obtains module, suitable for handling each parameter to the host data and data on flows that have relevance,
It is each between host processes title and data pack load corresponding to the host data for having relevance and data on flows to obtain
Corresponding relationship;
Module is established, is known suitable for establishing the application using the corresponding relationship of each pair of host processes title and data pack load
Other model.
C18, the device according to C17, wherein the comparison module is further adapted for:
Each host data is compared with each parameter that each data on flows carries;
It is identical according to multiple groups parameter, alternatively, identical parameters ratio is more than the comparison rules of proportion threshold value, have to find out
The host data and data on flows pair of relevance.
C19, the device according to C18, wherein
The parameter that host data carries includes at least: transmission time, source IP address, source port number, the target of host data
IP address, destination port number, the process title for handling host data;
The parameter that data on flows carries includes at least: receiving time, source IP address, source port number, the target of data on flows
IP address, destination port number, the data pack load of data on flows.
C20, the device according to C18, wherein the comparison module is further adapted for:
It is identical according to multiple groups parameter, alternatively, identical parameters ratio is more than the comparison rules of proportion threshold value, have to find out
The host data of relevance and data on flows are to later, according to screening rule to the determining host data and stream for having relevance
Data are measured to screening, further screening wherein has the host data and data on flows pair of spurious correlation;
Have the host data and data on flows pair of spurious correlation described in deletion.
C21, the device according to C20, wherein the comparison module is further adapted for:
If a host data and two or more data on flows have relevance, it is determined that this relevance is spurious correlation;
Or
If a host data and a data on flows have relevance, but the two time difference is more than time difference threshold value,
Then determine that this relevance is spurious correlation.
C22, according to the described in any item devices of C17 to C21, wherein the module of establishing is further adapted for:
Machine language conversion is carried out to host processes title and data pack load respectively, is converted into machine recognizable machine
Data;
Corresponding relationship is further established between host processes title and data pack load in post-conversion, and is closed using the correspondence
System establishes described using identification model.
C23, the device according to C22, wherein the module of establishing is further adapted for:
Host processes title and since 0 and one by one incremental ordered list are mapped, by each host processes title
It is converted into corresponding natural number.
C24, the device according to C22 or C23, wherein the module of establishing is further adapted for:
Corresponding decimal number is converted by the data pack load of hexadecimal string;
To the decimal number after conversion divided by 255, L [0,1] floating numbers are obtained, wherein L is the length of data pack load
Degree.
Based on another aspect of the present invention, it is disclosed that D25, a kind of identification device of data on flows, comprising:
Receiving module is suitable for receiving data on flows, wherein carry network side node in the data on flows and receive the stream
Measure data pack load when data;
Conversion module, suitable for the data on flows to be converted to the recognizable data using identification model;
Input module, it is described using identification model suitable for inputting the recognizable data, it obtains identified data and belongs to
The probability of different host processes;
Identification module, suitable for the probability that is obtained according to the input module identify the corresponding host of the data on flows into
Journey.
D26, the device according to D25, wherein the conversion module is further adapted for:
Machine language conversion is carried out to the data pack load of the data on flows, being converted into the application identification model can know
Other data.
D27, the device according to D26, wherein the conversion module is further adapted for:
Corresponding decimal number is converted by the data pack load of hexadecimal string;
To the decimal number after conversion divided by 255, L [0,1] floating numbers are obtained, wherein L is the length of data pack load
Degree.
D28, according to the described in any item devices of D25 to D27, wherein the identification module is further adapted for:
Maximum probability value is chosen as the judgement of the data on flows as a result, determining the corresponding process name of the data on flows
Claim.
Claims (27)
1. a kind of application identification model method for building up based on deep learning, is applied to host and network side node carries out data biography
Defeated environment is provided at least one host processes for having data-handling capacity on the host, comprising:
Obtain a plurality of host data of the host transmission, wherein carried in the host in each host data to the host
The host processes title that data are handled;
Obtain the received a plurality of data on flows of the network side node, wherein the network side section is carried in each data on flows
Point receives the data pack load when data on flows;
Each host data is compared with each data on flows, to find out at least a pair of of the host data for wherein having relevance
And data on flows;
Each parameter to the host data and data on flows that have relevance is handled, it is each to having relevance to obtain
Corresponding relationship between host processes title and data pack load corresponding to host data and data on flows;
It is established using the corresponding relationship of each pair of host processes title and data pack load described using identification model;
The application identification model uses as follows, comprising:
The input data using identification model is obtained, is handled through convolutional layer and pond layer, the depth for generating input data is special
Sign;
The depth characteristic is sent to full articulamentum identical with neural network, and the depth characteristic is parsed;
The parsing result of the depth characteristic is transmitted to output layer by the full articulamentum, is exported outward.
2. according to the method described in claim 1, wherein, each host data is compared with each data on flows, to find out
Wherein have at least a pair of of the host data and data on flows of relevance, comprising:
Each host data is compared with each parameter that each data on flows carries;
It is identical according to multiple groups parameter, alternatively, identical parameters ratio is more than the comparison rules of proportion threshold value, have association to find out
The host data and data on flows pair of property.
3. according to the method described in claim 2, wherein,
The parameter that host data carries includes at least: the transmission time of host data, source IP address, source port number, Target IP
Location, destination port number, the process title for handling host data;
The parameter that data on flows carries includes at least: the receiving time of data on flows, source IP address, source port number, Target IP
Location, destination port number, the data pack load of data on flows.
4. it is identical according to multiple groups parameter according to the method described in claim 2, wherein, alternatively, identical parameters ratio is more than ratio
The comparison rules of threshold value, to find out the host data for having relevance and data on flows to later, further includes:
According to screening rule to the determining host data for having relevance and data on flows to screening, further screening
Wherein have the host data and data on flows pair of spurious correlation;
Have the host data and data on flows pair of spurious correlation described in deletion.
5. according to the method described in claim 4, wherein, according to screening rule to the determining host data for having relevance and
For data on flows to screening, further screening wherein has a host data and data on flows pair of spurious correlation, including under
At least one column:
If a host data and two or more data on flows have relevance, it is determined that this relevance is spurious correlation;
If a host data and a data on flows have relevance, but the two time difference is more than time difference threshold value, then really
This fixed relevance is spurious correlation.
6. method according to any one of claims 1 to 5, wherein utilize each pair of host processes title and data pack load
Corresponding relationship establish it is described using identification model, comprising:
Machine language conversion is carried out to host processes title and data pack load respectively, is converted into machine recognizable number of machines
According to;
Corresponding relationship further is established between host processes title and data pack load in post-conversion, and is built using the corresponding relationship
It stands described using identification model.
7. according to the method described in claim 6, wherein, carrying out machine language conversion to host processes title, being converted into machine
Identifiable machine data, comprising:
Host processes title and since 0 and one by one incremental ordered list are mapped, each host processes title is converted
For corresponding natural number.
8. according to the method described in claim 6, wherein, carrying out machine language conversion to data pack load, being converted into machine can
The machine data of identification, comprising:
Corresponding decimal number is converted by the data pack load of hexadecimal string;
To the decimal number after conversion divided by 255, L [0,1] floating numbers are obtained, wherein L is the length of data pack load.
9. according to the method described in claim 1, wherein, the convolutional layer and pond layer multi-layer superposition use, and are superimposed
More, the depth characteristic is deeper.
10. method according to any one of claims 1 to 5, wherein the convolutional layer and the pond layer use in pairs.
11. method according to any one of claims 1 to 5, wherein the window dimension of the convolutional layer and the pond layer
It is the size of convolution or pond basic unit for 1*n, n.
12. a kind of recognition methods of data on flows utilizes the described in any item applications based on deep learning of claim 1-11
The application identification model that identification model method for building up is established, the recognition methods include:
Receive data on flows, wherein data packet when network side node receives the data on flows is carried in the data on flows
Load;
The data on flows is converted into the recognizable data using identification model;
The recognizable data input is described using identification model, it obtains identified data and belongs to the general of different host processes
Rate;
The corresponding host processes of the data on flows are identified according to the obtained probability.
13. according to the method for claim 12, wherein by the data on flows be converted to it is described using identification model can
Identify data, comprising:
Machine language conversion is carried out to the data pack load of the data on flows, it is identifiable to be converted into the application identification model
Data.
14. according to the method for claim 13, wherein carry out machine language to the data pack load of the data on flows and turn
Change, be converted into the identifiable data of the application identification model, comprising:
Corresponding decimal number is converted by the data pack load of hexadecimal string;
To the decimal number after conversion divided by 255, L [0,1] floating numbers are obtained, wherein L is the length of data pack load.
15. 2 to 14 described in any item methods according to claim 1, wherein identify the flow according to the obtained probability
The corresponding host processes of data, comprising:
Maximum probability value is chosen as the judgement of the data on flows as a result, determining the corresponding host processes name of the data on flows
Claim.
16. a kind of application identification model based on deep learning establishes device, it is applied to host and network side node carries out data
The environment of transmission is provided at least one host processes for having data-handling capacity on the host, comprising:
First obtains module, suitable for obtaining a plurality of host data of the host transmission, wherein carry in each host data
State the host processes title handled in host the host data;
Second obtains module, is suitable for obtaining the received a plurality of data on flows of network side node, wherein take in each data on flows
The data pack load when data on flows is received with the network side node;
Comparison module has relevance extremely to find out wherein suitable for each host data is compared with each data on flows
Few a pair of host data and data on flows;
Third obtains module, suitable for handling each parameter to the host data and data on flows that have relevance, to obtain
Take it is each to host processes title corresponding to the host data for having relevance and data on flows with it is corresponding between data pack load
Relationship;
Module is established, identifies mould suitable for establishing the application using the corresponding relationship of each pair of host processes title and data pack load
Type;Wherein, the application identification model uses as follows, comprising: obtains the input data using identification model, warp
Convolutional layer and the processing of pond layer, generate the depth characteristic of input data;The depth characteristic is sent to identical with neural network
Full articulamentum, and the depth characteristic is parsed;The parsing result of the depth characteristic is transmitted by the full articulamentum
To output layer, export outward.
17. device according to claim 16, wherein the comparison module is further adapted for:
Each host data is compared with each parameter that each data on flows carries;
It is identical according to multiple groups parameter, alternatively, identical parameters ratio is more than the comparison rules of proportion threshold value, have association to find out
The host data and data on flows pair of property.
18. device according to claim 17, wherein
The parameter that host data carries includes at least: the transmission time of host data, source IP address, source port number, Target IP
Location, destination port number, the process title for handling host data;
The parameter that data on flows carries includes at least: the receiving time of data on flows, source IP address, source port number, Target IP
Location, destination port number, the data pack load of data on flows.
19. device according to claim 18, wherein the comparison module is further adapted for:
It is identical according to multiple groups parameter, alternatively, identical parameters ratio is more than the comparison rules of proportion threshold value, have association to find out
The host data of property and data on flows are to later, according to screening rule to the determining host data and flow number for having relevance
According to screening, further screening wherein has the host data and data on flows pair of spurious correlation;
Have the host data and data on flows pair of spurious correlation described in deletion.
20. device according to claim 19, wherein the comparison module is further adapted for:
If a host data and two or more data on flows have relevance, it is determined that this relevance is spurious correlation;Or
If a host data and a data on flows have relevance, but the two time difference is more than time difference threshold value, then really
This fixed relevance is spurious correlation.
21. 6 to 20 described in any item devices according to claim 1, wherein the module of establishing is further adapted for:
Machine language conversion is carried out to host processes title and data pack load respectively, is converted into machine recognizable number of machines
According to;
Corresponding relationship further is established between host processes title and data pack load in post-conversion, and is built using the corresponding relationship
It stands described using identification model.
22. device according to claim 21, wherein the module of establishing is further adapted for:
Host processes title and since 0 and one by one incremental ordered list are mapped, each host processes title is converted
For corresponding natural number.
23. device according to claim 21, wherein the module of establishing is further adapted for:
Corresponding decimal number is converted by the data pack load of hexadecimal string;
To the decimal number after conversion divided by 255, L [0,1] floating numbers are obtained, wherein L is the length of data pack load.
24. a kind of identification device of data on flows utilizes described in any item the answering based on deep learning of claim 16 to 23
The application identification model that device is established is established with identification model, the identification device includes:
Receiving module is suitable for receiving data on flows, wherein carry network side node in the data on flows and receive the flow number
According to when data pack load;
Conversion module, suitable for the data on flows to be converted to the recognizable data using identification model;
Input module, it is described using identification model suitable for inputting the recognizable data, it obtains identified data and belongs to difference
The probability of host processes;
Identification module, the probability suitable for being obtained according to the input module identify the corresponding host processes of the data on flows.
25. device according to claim 24, wherein the conversion module is further adapted for:
Machine language conversion is carried out to the data pack load of the data on flows, it is identifiable to be converted into the application identification model
Data.
26. device according to claim 25, wherein the conversion module is further adapted for:
Corresponding decimal number is converted by the data pack load of hexadecimal string;
To the decimal number after conversion divided by 255, L [0,1] floating numbers are obtained, wherein L is the length of data pack load.
27. according to the described in any item devices of claim 24 to 26, wherein the identification module is further adapted for:
Maximum probability value is chosen as the judgement of the data on flows as a result, determining the corresponding process title of the data on flows.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610018242.2A CN105516027B (en) | 2016-01-12 | 2016-01-12 | Using identification model method for building up, the recognition methods of data on flows and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610018242.2A CN105516027B (en) | 2016-01-12 | 2016-01-12 | Using identification model method for building up, the recognition methods of data on flows and device |
Publications (2)
Publication Number | Publication Date |
---|---|
CN105516027A CN105516027A (en) | 2016-04-20 |
CN105516027B true CN105516027B (en) | 2019-03-12 |
Family
ID=55723677
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610018242.2A Active CN105516027B (en) | 2016-01-12 | 2016-01-12 | Using identification model method for building up, the recognition methods of data on flows and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN105516027B (en) |
Families Citing this family (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105812188A (en) * | 2016-04-25 | 2016-07-27 | 北京网康科技有限公司 | Traffic recognition method and device |
CN106130839B (en) * | 2016-07-12 | 2019-03-01 | 电子科技大学 | A kind of business recognition method applied to broadband access network |
CN106790019B (en) * | 2016-12-14 | 2019-10-11 | 北京天融信网络安全技术有限公司 | Encryption method for recognizing flux and device based on feature self study |
CN108924090B (en) * | 2018-06-04 | 2020-12-11 | 上海交通大学 | Method for detecting traffics of shadowsocks based on convolutional neural network |
CN110784330B (en) * | 2018-07-30 | 2022-04-05 | 华为技术有限公司 | Method and device for generating application recognition model |
CN109361617B (en) * | 2018-09-26 | 2022-09-27 | 中国科学院计算机网络信息中心 | Convolutional neural network traffic classification method and system based on network packet load |
CN109802868B (en) * | 2019-01-10 | 2022-05-06 | 中山大学 | Mobile application real-time identification method based on cloud computing |
CN113326946A (en) * | 2020-02-29 | 2021-08-31 | 华为技术有限公司 | Method, device and storage medium for updating application recognition model |
CN114499941B (en) * | 2021-12-22 | 2023-08-04 | 天翼云科技有限公司 | Training and detecting method of flow detection model and electronic equipment |
CN116204386B (en) * | 2023-04-26 | 2023-07-28 | 北京明易达科技股份有限公司 | Method, system, medium and equipment for automatically identifying and monitoring application service relationship |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101741908A (en) * | 2009-12-25 | 2010-06-16 | 青岛朗讯科技通讯设备有限公司 | Identification method for application layer protocol characteristic |
CN101764748A (en) * | 2009-12-16 | 2010-06-30 | 福建星网锐捷网络有限公司 | Method for identifying application program, device and system thereof |
CN105100091A (en) * | 2015-07-13 | 2015-11-25 | 北京奇虎科技有限公司 | Protocol identification method and protocol identification system |
-
2016
- 2016-01-12 CN CN201610018242.2A patent/CN105516027B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101764748A (en) * | 2009-12-16 | 2010-06-30 | 福建星网锐捷网络有限公司 | Method for identifying application program, device and system thereof |
CN101741908A (en) * | 2009-12-25 | 2010-06-16 | 青岛朗讯科技通讯设备有限公司 | Identification method for application layer protocol characteristic |
CN105100091A (en) * | 2015-07-13 | 2015-11-25 | 北京奇虎科技有限公司 | Protocol identification method and protocol identification system |
Also Published As
Publication number | Publication date |
---|---|
CN105516027A (en) | 2016-04-20 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN105516027B (en) | Using identification model method for building up, the recognition methods of data on flows and device | |
CN104391881B (en) | A kind of daily record analytic method and system based on segmentation methods | |
CN103870381B (en) | A kind of test data generating method and device | |
US8068431B2 (en) | System and method for deep packet inspection | |
CN108124487A (en) | cloud meter reading method and device | |
US8037057B2 (en) | Multi-column statistics usage within index selection tools | |
CN105100091A (en) | Protocol identification method and protocol identification system | |
CN101453424B (en) | Network information resource access control method and system | |
US20180268081A1 (en) | Data extraction | |
CN110765639A (en) | Electrical simulation modeling method and device and readable storage medium | |
CN105868311A (en) | Data analyzing method and device | |
US11914641B2 (en) | Text to color palette generator | |
CN113760730B (en) | Automatic test method and device | |
CN109240903A (en) | A kind of method and apparatus assessed automatically | |
CN104021147B (en) | A kind of code stream analyzing method and device | |
CN112395371B (en) | Financial institution asset classification processing method, device and readable medium | |
CN117675406A (en) | Heterogeneous task flow intelligent analysis method based on power law segmentation length sequence | |
CN109063040A (en) | Client-side program collecting method and system | |
CN110309214A (en) | A kind of instruction executing method and its equipment, storage medium, server | |
CN106293862B (en) | A kind of analysis method and device of expandable mark language XML data | |
CN110493058A (en) | The construction method and device of network topology structure, storage medium, terminal | |
CN103220274B (en) | A kind of network message pattern matching process for operator's network outlet and system | |
CN113835712B (en) | Fast data packet routing method for judging according to given field value | |
US9172595B2 (en) | Systems and methods of packet object database management | |
CN108270599A (en) | A kind of data analyzing and processing method and system based on snmp protocol |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
CP01 | Change in the name or title of a patent holder | ||
CP01 | Change in the name or title of a patent holder |
Address after: 100088 room 112, block D, 28 new street, new street, Xicheng District, Beijing (Desheng Park) Co-patentee after: QAX Technology Group Inc. Patentee after: BEIJING QIHOO TECHNOLOGY Co.,Ltd. Address before: 100088 room 112, block D, 28 new street, new street, Xicheng District, Beijing (Desheng Park) Co-patentee before: BEIJING QIANXIN TECHNOLOGY Co.,Ltd. Patentee before: BEIJING QIHOO TECHNOLOGY Co.,Ltd. |