CN105516027B - Using identification model method for building up, the recognition methods of data on flows and device - Google Patents

Using identification model method for building up, the recognition methods of data on flows and device Download PDF

Info

Publication number
CN105516027B
CN105516027B CN201610018242.2A CN201610018242A CN105516027B CN 105516027 B CN105516027 B CN 105516027B CN 201610018242 A CN201610018242 A CN 201610018242A CN 105516027 B CN105516027 B CN 105516027B
Authority
CN
China
Prior art keywords
data
host
flows
identification model
relevance
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201610018242.2A
Other languages
Chinese (zh)
Other versions
CN105516027A (en
Inventor
王占
王占一
刘博�
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Qax Technology Group Inc
Beijing Qihoo Technology Co Ltd
Original Assignee
Beijing Qihoo Technology Co Ltd
Beijing Qianxin Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Qihoo Technology Co Ltd, Beijing Qianxin Technology Co Ltd filed Critical Beijing Qihoo Technology Co Ltd
Priority to CN201610018242.2A priority Critical patent/CN105516027B/en
Publication of CN105516027A publication Critical patent/CN105516027A/en
Application granted granted Critical
Publication of CN105516027B publication Critical patent/CN105516027B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L47/00Traffic control in data switching networks
    • H04L47/70Admission control; Resource allocation
    • H04L47/80Actions related to the user profile or the type of traffic
    • H04L47/803Application aware

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Computer And Data Communications (AREA)

Abstract

The present invention provides a kind of using identification model method for building up, the recognition methods of data on flows and device.Using identification model method for building up, applied to the environment that host and network side node carry out data transmission, at least one host processes for having data-handling capacity are provided on host, comprising: obtain a plurality of host data of host transmission;Obtain the received a plurality of data on flows of network side node;Each host data is compared with each data on flows, to find out at least a pair of of the host data and data on flows that wherein have relevance;Each parameter to the host data and data on flows that have relevance is handled, to obtain each corresponding relationship between host processes title and data pack load corresponding to the host data for having relevance and data on flows;It is established using the corresponding relationship of each pair of host processes title and data pack load and applies identification model.Accuracy, convenience and efficiency using identification can be increased using the present invention.

Description

Using identification model method for building up, the recognition methods of data on flows and device
Technical field
The present invention relates to field of computer technology, establish more particularly to a kind of application identification model based on deep learning The recognition methods of method and device and a kind of data on flows and device.
Background technique
In intranet environment, different applications is often in different priority.Such as answering for management class With with P2P class application, the former can all be classified as to important application in most enterprises, and the latter is classified as limitation application.? In different enterprises, even identical application, it is also possible to which priority is different.For example, being equally video class application, in video class public affairs The status of department and electric business class company is different.Meanwhile the service condition for understanding each application is conducive to reasonably optimizing and configuration enterprise Application and network in industry, to guarantee the quick efficient development transmitted and work of information.Therefore application is identified in local area network It is very important.
One is port match, i.e., the source port or target port used data flow and the existing application-port of system Database is compared, so that it is determined that the corresponding application of data flow.The data flow that certain applications generate is carried out using particular port Transmission, therefore be feasible to this certain applications.The advantages of this mode is the storage and analysis for not needing mass data, also not Need complicated algorithm, system burden very little.But actual conditions are that some ports can correspond to a variety of applications, or application is adopted Port is simultaneously not fixed, and a variety of possibilities cause the accuracy of identification not high.
Another method is pattern match, and most widely used method at present.Pattern match is divided into two classes, Yi Leishi In host side, according to the file characteristic that known features storehouse matching is applied, as product version, name of product, company, FileVersion, Source filename etc..This method needs install identification software on every host, will affect user experience and host performance.Also One kind is in network side by identifying application to characterization rules known to data stream matches.This method need artificial analysis and Defined feature, and have newly-increased application daily, manual analysis workload is too big, does not catch up with much using newly-increased speed.
Summary of the invention
In view of the above problems, it proposes on the present invention overcomes the above problem or at least be partially solved in order to provide one kind State problem a kind of application identification model method for building up and device based on deep learning and a kind of identification side of data on flows Method and device.
According to an aspect of the present invention, the application identification model based on deep learning that the embodiment of the invention provides a kind of Method for building up is provided at least one tool on the host applied to the environment that host and network side node carry out data transmission The host processes of standby data-handling capacity, comprising:
Obtain a plurality of host data of the host transmission, wherein carried in the host in each host data to this The host processes title that host data is handled;
Obtain the received a plurality of data on flows of the network side node, wherein the network is carried in each data on flows Side gusset receives the data pack load when data on flows;
Each host data is compared with each data on flows, to find out at least a pair of of the host for wherein having relevance Data and data on flows;
Each parameter to the host data and data on flows that have relevance is handled, with obtain it is each to have association Corresponding relationship corresponding to the host data and data on flows of property between host processes title and data pack load;
It is established using the corresponding relationship of each pair of host processes title and data pack load described using identification model.
Optionally, each host data is compared with each data on flows, has relevance at least to find out wherein A pair of of host data and data on flows, comprising:
Each host data is compared with each parameter that each data on flows carries;
It is identical according to multiple groups parameter, alternatively, identical parameters ratio is more than the comparison rules of proportion threshold value, have to find out The host data and data on flows pair of relevance.
Optionally, the parameter that host data carries includes at least: the transmission time of host data, source IP address, source port Number, target ip address, destination port number, handle host data process title;
The parameter that data on flows carries includes at least: receiving time, source IP address, source port number, the target of data on flows IP address, destination port number, the data pack load of data on flows.
Optionally, identical according to multiple groups parameter, alternatively, identical parameters ratio is more than the comparison rules of proportion threshold value, to look into The host data for having relevance and data on flows are found out to later, further includes:
The determining host data for having relevance and data on flows are further sieved to screening according to screening rule Select the host data and data on flows pair for wherein having spurious correlation;
Have the host data and data on flows pair of spurious correlation described in deletion.
Optionally, according to screening rule to the determining host data for having relevance and data on flows to screening, Further screening wherein has the host data and data on flows pair of spurious correlation, including at least one following:
If a host data and two or more data on flows have relevance, it is determined that this relevance is spurious correlation;
If a host data and a data on flows have relevance, but the two time difference is more than time difference threshold value, Then determine that this relevance is spurious correlation.
Optionally, the application is established using the corresponding relationship of each pair of host processes title and data pack load identify mould Type, comprising:
Machine language conversion is carried out to host processes title and data pack load respectively, is converted into machine recognizable machine Data;
Corresponding relationship is further established between host processes title and data pack load in post-conversion, and is closed using the correspondence System establishes described using identification model.
Optionally, machine language conversion is carried out to host processes title, is converted into machine recognizable machine data, wrapped It includes:
Host processes title and since 0 and one by one incremental ordered list are mapped, by each host processes title It is converted into corresponding natural number.
Optionally, machine language conversion is carried out to data pack load, is converted into machine recognizable machine data, comprising:
Corresponding decimal number is converted by the data pack load of hexadecimal string;
To the decimal number after conversion divided by 255, L [0,1] floating numbers are obtained, wherein L is the length of data pack load Degree.
Optionally, the application identification model uses as follows, comprising:
The input data using identification model is obtained, is handled through convolutional layer and pond layer, generates the depth of input data Spend feature;
The depth characteristic is sent to full articulamentum identical with neural network, and the depth characteristic is parsed;
The parsing result of the depth characteristic is transmitted to output layer by the full articulamentum, is exported outward.
Optionally, the convolutional layer and pond layer multi-layer superposition use, and superposition is more, and the depth characteristic is got over It is deep.
Optionally, the convolutional layer and the pond layer use in pairs.
Optionally, the window dimension of the convolutional layer and the pond layer is 1*n.
According to another aspect of the present invention, the embodiment of the invention provides a kind of recognition methods of data on flows, comprising:
Receive data on flows, wherein number when network side node receives the data on flows is carried in the data on flows According to payload package;
The data on flows is converted into the recognizable data using identification model;
The recognizable data input is described using identification model, it obtains identified data and belongs to different host processes Probability;
The corresponding host processes of the data on flows are identified according to the obtained probability.
Optionally, the data on flows is converted into the recognizable data using identification model, comprising:
Machine language conversion is carried out to the data pack load of the data on flows, being converted into the application identification model can know Other data.
Optionally, machine language conversion is carried out to the data pack load of the data on flows, is converted into the application identification The identifiable data of model, comprising:
Corresponding decimal number is converted by the data pack load of hexadecimal string;
To the decimal number after conversion divided by 255, L [0,1] floating numbers are obtained, wherein L is the length of data pack load Degree.
Optionally, the corresponding host processes of the data on flows are identified according to the obtained probability, comprising:
Maximum probability value is chosen as the judgement of the data on flows as a result, determine the corresponding host of the data on flows into Journey title.
According to a further aspect of the invention, the embodiment of the invention provides a kind of, and the application based on deep learning identifies mould Type establishes device, applied to the environment that host and network side node carry out data transmission, is provided at least one on the host The host processes for having data-handling capacity, comprising:
First obtains module, suitable for obtaining a plurality of host data of the host transmission, wherein carry in each host data There is the host processes title handled in the host the host data;
Second obtains module, is suitable for obtaining the received a plurality of data on flows of network side node, wherein each data on flows In carry data pack load when the network side node receives the data on flows;
Comparison module wherein has relevance suitable for each host data is compared with each data on flows to find out At least a pair of of host data and data on flows;
Third obtains module, suitable for handling each parameter to the host data and data on flows that have relevance, It is each between host processes title and data pack load corresponding to the host data for having relevance and data on flows to obtain Corresponding relationship;
Module is established, is known suitable for establishing the application using the corresponding relationship of each pair of host processes title and data pack load Other model.
Optionally, the comparison module is further adapted for:
Each host data is compared with each parameter that each data on flows carries;
It is identical according to multiple groups parameter, alternatively, identical parameters ratio is more than the comparison rules of proportion threshold value, have to find out The host data and data on flows pair of relevance.
Optionally, the parameter that host data carries includes at least: the transmission time of host data, source IP address, source port Number, target ip address, destination port number, handle host data process title;
The parameter that data on flows carries includes at least: receiving time, source IP address, source port number, the target of data on flows IP address, destination port number, the data pack load of data on flows.
Optionally, the comparison module is further adapted for:
It is identical according to multiple groups parameter, alternatively, identical parameters ratio is more than the comparison rules of proportion threshold value, have to find out The host data of relevance and data on flows are to later, according to screening rule to the determining host data and stream for having relevance Data are measured to screening, further screening wherein has the host data and data on flows pair of spurious correlation;
Have the host data and data on flows pair of spurious correlation described in deletion.
Optionally, the comparison module is further adapted for:
If a host data and two or more data on flows have relevance, it is determined that this relevance is spurious correlation; Or
If a host data and a data on flows have relevance, but the two time difference is more than time difference threshold value, Then determine that this relevance is spurious correlation.
Optionally, the module of establishing is further adapted for:
Machine language conversion is carried out to host processes title and data pack load respectively, is converted into machine recognizable machine Data;
Corresponding relationship is further established between host processes title and data pack load in post-conversion, and is closed using the correspondence System establishes described using identification model.
Optionally, the module of establishing is further adapted for:
Host processes title and since 0 and one by one incremental ordered list are mapped, by each host processes title It is converted into corresponding natural number.
Optionally, the module of establishing is further adapted for:
Corresponding decimal number is converted by the data pack load of hexadecimal string;
To the decimal number after conversion divided by 255, L [0,1] floating numbers are obtained, wherein L is the length of data pack load Degree.
According to a further aspect of the invention, the embodiment of the invention provides a kind of identification devices of data on flows, comprising:
Receiving module is suitable for receiving data on flows, wherein carry network side node in the data on flows and receive the stream Measure data pack load when data;
Conversion module, suitable for the data on flows to be converted to the recognizable data using identification model;
Input module, it is described using identification model suitable for inputting the recognizable data, it obtains identified data and belongs to The probability of different host processes;
Identification module, suitable for the probability that is obtained according to the input module identify the corresponding host of the data on flows into Journey.
Optionally, the conversion module is further adapted for:
Machine language conversion is carried out to the data pack load of the data on flows, being converted into the application identification model can know Other data.
Optionally, the conversion module is further adapted for:
Corresponding decimal number is converted by the data pack load of hexadecimal string;
To the decimal number after conversion divided by 255, L [0,1] floating numbers are obtained, wherein L is the length of data pack load Degree.
Optionally, the identification module is further adapted for:
Maximum probability value is chosen as the judgement of the data on flows as a result, determining the corresponding process name of the data on flows Claim.
In embodiments of the present invention, host data is compared respectively with data on flows, finds out relating dot therein, And then have the host data and data on flows of relevance according to relevance screening.In host computer side, host data be by specifically into Journey issues, and in network side, data on flows can determine that its corresponding data pack load, the embodiment of the present invention pass through master when obtaining The relevance of machine data and data on flows, further practical corresponding host processes title and data pack load are determined in analysis Corresponding relationship, and generated according to the corresponding relationship and apply identification model.The subsequent application identification, can be with when using data on flows According to the corresponding process title for finding the transmission host data of the data pack load of data on flows, and then determine to send the master The application of machine data.And the identification applied can determine the priority of the application, and then determine that the processing of the data on flows is preferential Grade is conducive to reasonably optimizing and configures application and the network in enterprise, guarantees the efficient development of the quick transmitting and work of information. That is, the application identification model of the initiation application of identification data on flows can be established using the embodiment of the present invention, as long as by flow number Which kind of the application hair that can obtain the data on flows rapidly by host using identification model established according to the input embodiment of the present invention Out, it is participated in without artificial, also need not increase identification software in host side, considerably increase the accuracy using identification, convenience And efficiency.
The above description is only an overview of the technical scheme of the present invention, in order to better understand the technical means of the present invention, And it can be implemented in accordance with the contents of the specification, and in order to allow above and other objects of the present invention, feature and advantage can It is clearer and more comprehensible, the followings are specific embodiments of the present invention.
According to the following detailed description of specific embodiments of the present invention in conjunction with the accompanying drawings, those skilled in the art will be brighter The above and other objects, advantages and features of the present invention.
Detailed description of the invention
By reading the following detailed description of the preferred embodiment, various other advantages and benefits are common for this field Technical staff will become clear.The drawings are only for the purpose of illustrating a preferred embodiment, and is not considered as to the present invention Limitation.And throughout the drawings, the same reference numbers will be used to refer to the same parts.In the accompanying drawings:
Fig. 1 shows the place using identification model method for building up according to an embodiment of the invention based on deep learning Manage flow chart;
Fig. 2 shows the use flow diagrams according to an embodiment of the invention using identification model;
Fig. 3 shows a kind of recognition methods of data on flows according to an embodiment of the invention;
Fig. 4 shows the knot that device is established using identification model according to an embodiment of the invention based on deep learning Structure schematic diagram;
Fig. 5 shows the structural schematic diagram of the identification device of data on flows according to an embodiment of the invention;And
Fig. 6 shows the establishment process according to an embodiment of the invention using identification model and subsequent flow number According to a system schematic of identification process.
Specific embodiment
Exemplary embodiments of the present disclosure are described in more detail below with reference to accompanying drawings.Although showing the disclosure in attached drawing Exemplary embodiment, it being understood, however, that may be realized in various forms the disclosure without should be by embodiments set forth here It is limited.On the contrary, these embodiments are provided to facilitate a more thoroughly understanding of the present invention, and can be by the scope of the present disclosure It is fully disclosed to those skilled in the art.
In order to solve the above technical problems, the embodiment of the present invention proposes to establish a kind of new application identification model, which knows Other model is generated based on deep learning, and deep learning is a branch in machine learning field, and essence establishes a set of automatic point The deep neural network of study is analysed, the mechanism that it imitates human brain carrys out learning data, takes in recent years in image, voice and text field Obtained significant achievement.The ability that self study is also equipped with using identification model that the embodiment of the present invention is established, can be for number According to being analyzed so that it is determined that deep learning feature out, automatic learning model parameter, independent of manual analysis and file characteristic Library.Also, the foundation of application identification model, without installing identification software to host, does not need but also in identification process yet Host side carries out data storage and operation, can realize unaware state to user, improve user and experience experience.
Based on the inventive concept, the embodiment of the present invention proposes a kind of application identification model foundation side based on deep learning Method, this method are applied to the environment that host and network side node carry out data transmission, wherein the data obtained from host computer side are hereinafter Referred to as host data, the data obtained from network side node are hereinafter data on flows.And at least one is provided on host Item has the host processes of data-handling capacity.Fig. 1 shows answering based on deep learning according to an embodiment of the invention With the process flow diagram of identification model method for building up.Referring to Fig. 1, this method is included at least:
Step S102, a plurality of host data of host transmission is obtained, wherein carried in host in each host data to this The host processes title (process) that host data is handled;
Step S104, the received a plurality of data on flows of network side node is obtained, wherein carry network in each data on flows Side gusset receives the data pack load (payload) when the data on flows;
Step S106, each host data is compared with each data on flows, has relevance extremely to find out wherein Few a pair of host data and data on flows;
Step S108, each parameter to the host data and data on flows that have relevance is handled, it is each to obtain To the corresponding relationship corresponding to the host data and data on flows for having relevance between host processes title and data pack load;
Step S110, it is established using the corresponding relationship of each pair of host processes title and data pack load and applies identification model.
In embodiments of the present invention, host data is compared respectively with data on flows, finds out relating dot therein, And then have the host data and data on flows of relevance according to relevance screening.In host computer side, host data be by specifically into Journey issues, and in network side, data on flows can determine that its corresponding data pack load, the embodiment of the present invention pass through master when obtaining The relevance of machine data and data on flows, further practical corresponding host processes title and data pack load are determined in analysis Corresponding relationship, and generated according to the corresponding relationship and apply identification model.The subsequent application identification, can be with when using data on flows According to the corresponding process title for finding the transmission host data of the data pack load of data on flows, and then determine to send the master The application of machine data.And the identification applied can determine the priority of the application, and then determine that the processing of the data on flows is preferential Grade is conducive to reasonably optimizing and configures application and the network in enterprise, guarantees the efficient development of the quick transmitting and work of information. That is, the application identification model of the initiation application of identification data on flows can be established using the embodiment of the present invention, as long as by flow number Which kind of the application hair that can obtain the data on flows rapidly by host using identification model established according to the input embodiment of the present invention Out, it is participated in without artificial, also need not increase identification software in host side, considerably increase the accuracy using identification, convenience And efficiency.
In a preferred embodiment, step S106 each host data need to be compared with each data on flows, to look into Find out at least a pair of of the host data and data on flows for wherein having relevance.Because host data and data on flows carry it is more Group parameter, for data, parameter normally comprises the most crucial content or mark class content of data, therefore can be direct Each host data will be compared with each parameter that each data on flows carries.If wherein there is host data and data on flows symbol Conjunction multiple groups parameter is identical (such as 3 groups or more), alternatively, identical parameters ratio is more than the comparison of proportion threshold value (such as 60% or more) Rule, it is determined that the two should have relevance, successively compare to find out the All hosts data and flow that have relevance Data pair.
In a specific embodiment, the parameter that host data carries includes at least: the transmission time of host data, source IP address, source port number, target ip address, destination port number, the process title for handling host data.The ginseng that data on flows carries Number includes at least: the receiving time of data on flows, source IP address, source port number, target ip address, destination port number, flow number According to data pack load.At this point, if the source IP address of host data, source port number, destination port number and data on flows source IP Address, source port number, destination port number are identical, it may be considered that this host data and this data on flows should have pass Connection property.
Further, most of associated datas can be determined using above-mentioned alignments, but still some special circumstances, this hair Bright embodiment is referred to as the host data for having spurious correlation and data on flows pair.These feature situations if it exists, then need root According to screening rule to the determining host data for having relevance and data on flows to screening, further screening wherein has The host data and data on flows pair of standby spurious correlation, and delete the host data and data on flows pair for having spurious correlation.
The special circumstances for meeting meaning of the embodiment of the present invention if meeting following any one determine that the relevance is closed to be pseudo- Connection property:
If a first, host data and two or more data on flows have relevance, it is determined that this relevance is pseudo- closes Connection property;
After sending because of a host data, what network side node received also must be a data, if occurring two The above associated data on flows, it was demonstrated that data transmission procedure may obscure other data, and determining relevance is not at this time It uniquely determines.If this data, which is applied to establish, applies identification model, it is likely that it is identified same data on flows occur Out the case where two applications, the accuracy of the result using identification model is influenced.
If a second, host data and a data on flows have relevance, but the two time difference is more than the time difference Threshold value, it is determined that this relevance is spurious correlation.
Because of the data high-speed access of cybertimes, data transmission procedure is typically more quick, and the time is very short, if the two time It differs too big, it is likely that host data is lost, and received network side node is not the corresponding flow of this host data Data, for guarantee establish using identification model data accuracy, such case not with use.
In view of being to exist as machine mould, and host processes title and data pack load are equal using identification model itself It is not machine language, therefore, when implementing, machine language can be carried out to host processes title and data pack load respectively in advance Conversion, is converted into machine recognizable machine data, further builds between host processes title and data pack load in post-conversion Vertical corresponding relationship, and established using the corresponding relationship and apply identification model.
Specifically, host processes title and since 0 and one by one incremental ordered list can be mapped, by each master Machine process title is converted into corresponding natural number.Furthermore it is possible to by the data pack load of hexadecimal string be converted into corresponding ten into Number processed;To the decimal number after conversion divided by 255, L [0,1] floating numbers are obtained, wherein L is the length of data pack load.
The embodiment of the invention provides a host processes titles and data pack load to carry out the specific of machine language conversion Embodiment.Table one shows the corresponding relationship column of the host processes title and data pack load that obtain according to embodiments of the present invention Table.
Table one
Payload Process
474554202f… S.exe
474554201e… S.exe
70a100347b… A.exe
803d010301… B.exe
160302009c… C.exe
803d010302… B.exe
504f535420… D.exe
Firstly, being converted to data pack load.For Payload: converting corresponding 0-255's for hexadecimal string Decimal number, then to every number divided by 255.For each sample, what is obtained is L [0,1] floating numbers, and L is the length of load Degree, specific conversion results refer to table two.
Table two
Secondly, being converted to host processes title.Specifically, by title be mapped to one since 0 one by one be incremented by and Orderly table, is specifically shown in Table three.
Table three
Process
0 S.exe
1 A.exe
2 B.exe
3 C.exe
4 D.exe
…… ……
For convenience of retrieval, further process each in table three is substituted using Numerical Index, and then generate table four:
Table four
After training data conversion end, then the data based on identification are generated using identification model simultaneously the embodiment of the present invention It uses.Fig. 2 shows the use flow diagrams according to an embodiment of the invention using identification model.The embodiment of the present invention The application identification model of offer is generated based on convolutional neural networks (CNN), is a kind of deep learning model, is usually used in image recognition Field: such as Handwritten Digital Recognition, recognition of face, picture classification.Application identification model provided in an embodiment of the present invention and biography System CNN model is maximum to distinguish the n*n, 1*n (n for being that the window dimension in convolution sum pond is not suitable for two dimensional image instead of It is the size of convolution or pond basic unit).
Specifically, the process for using using identification model includes:
Firstly, obtaining the input data for applying identification model, is handled through convolutional layer and pond layer, generate the depth of input data Spend feature;The data of input first pass through several convolutional layer and pond layer.Usual convolutional layer and pond layer use in pairs, or Convolutional layer is only used after certain depth, does not use pond layer.Superposition is more, and the network of formation is deeper.Convolutional layer and pond layer are extremely Depth characteristic could be generated using 2 times less.
Secondly, sending depth characteristic to full articulamentum identical with neural network, and depth characteristic is parsed;It will be deep It spends feature and is sent into full articulamentum identical with traditional neural network, the full connection number of plies should not be excessive, and general 1 to 3 layer.Final transmitting To output layer.
Finally, the parsing result of depth characteristic is transmitted to output layer by full articulamentum, export outward.
Certainly, using identification model needs update, the model modification period as the case may be depending on.If used GPU high performance computation can complete model modification by daily or weekly training in conjunction with real resource and business demand;If Using CPU cluster operation model modification can be completed by weekly or monthly training in conjunction with real resource and business demand.
After being successfully established using identification model, the embodiment of the present invention can carry out the knowledge of data on flows using it Not.Fig. 3 shows a kind of recognition methods of data on flows according to an embodiment of the invention.Referring to Fig. 3, this method is at least wrapped It includes:
Step S302, data on flows is received, wherein when carrying network side node in data on flows and receiving the data on flows Data pack load;
Step S304, data on flows is converted to the recognizable data using identification model;
Step S306, identification model is applied into the input of recognizable data, obtains identified data and belongs to different host processes Probability;
Step S308, according to the obtained corresponding host processes of probability identification data on flows.
Specifically, maximum probability value can be chosen as the judgement of data on flows as a result, determining the corresponding master of data on flows Machine process title.
It is mentioned above when being established using identification model, for convenience of reading data or application, data on flows need to be converted to Machine language, similarly, in using the identification process for carrying out data on flows using identification model, it is also desirable to step S304 is executed, Data on flows is converted into the recognizable data using identification model.Specifically, machine is carried out to the data pack load of data on flows The conversion of device language, is converted into using the identifiable data of identification model.Firstly, converting the data pack load of hexadecimal string to Corresponding decimal number.Secondly, obtaining L [0,1] floating numbers divided by 255 to the decimal number after conversion, wherein L is number According to the length of payload package.
The embodiment of the present invention provides the specific embodiment of data on flows identification.Table is obtained after data to be identified are transformed Data shown in five, and then it is identified using using identification model, every data to be identified, which are exported, in output layer belongs to respectively The probability of class application, specifically refers to table five:
Table five
Finally take the corresponding Apply Names of maximum probability value as the judgement of the data as a result, as taken C.exe in example For the recognition result of the data.
For the establishment process provided in an embodiment of the present invention using identification model and subsequent data on flows were identified Journey becomes apparent from clearer with illustrating, the embodiment of the invention provides a complete embodiments to be described, and is specifically shown in down Text.
1, the acquisition of data
In the training stage, data are divided into two parts: host data and data on flows
Host data is obtained from host side, including Time (time), SIP (source IP address), SPort (source port number), DIP (target ip address), DPort (destination port number), Process are (using corresponding process title in operation, such as " svchost.exe "), generate hexa-atomic group.Specifically it is shown in Table six:
Table six
Data on flows is obtained from network side, including Time (time), SIP (source IP address), SPort (source port number), DIP (target ip address), DPort (destination port number), Payload (load, the spliced data of TCP network flow uplink and downlink Packet, such as " 705ba387fe ... "), generate hexa-atomic group.Specifically it is shown in Table seven:
Table seven
In cognitive phase, input data only uses data on flows, and data on flows form is identical as the training stage.
2, the building (association of host and data on flows) of training data
Inside the field of both the above table, what two kinds of data all had has: Time, SIP, SPort, DIP, DPort.First lead to SIP, SPort, DIP, DPort is crossed accurately to be matched.And the time that Time is recorded or uploaded due to data source difference, system Difference leads to the delay for having certain, so needing to do further association with Time.
By taking both the above schematic table as an example, the case where needing specially treated when representing several associated datas:
(1) pass through the accurate matching of four-tuple, 1 data of host is tentatively associated with 2 datas of flow, but the time It is close, it can not specifically determine that the application program that Process is A.exe is corresponding with which Payload by the time.Therefore this feelings Condition is not associated with, and is added without training data.
Four-tuple in (2) two kinds of data corresponds, and time interval is little, therefore, it is considered that can correctly be associated with, it will Training data is added in " 474554202f ... " and " S.exe ".
(3) although four-tuple corresponds, time phase difference 31 minutes, interval was excessive, therefore, it is considered that be uncorrelated data, Without association.Here, for the threshold value of time interval depending on the real data in each local area network, usual value is no more than 10 points Clock.
Training data in final association is shown in Table one:
Table one
3, data convert
To Payload: converting hexadecimal string to the decimal number of corresponding 0-255, then to every number divided by 255. For each sample, what is obtained is L [0,1] floating numbers, and L is the length of load.
Example is shown in Table two:
Table two
For data to be identified, this step is only carried out with regard to much of that;And for training data, it is also necessary to be located as follows Reason:
For the Apply Names of Process: title is mapped to a table incremental and orderly one by one since 0.Example It is shown in Table three:
Table three
Process
0 S.exe
1 A.exe
2 B.exe
3 C.exe
4 D.exe
…… ……
For convenience of retrieval, further process each in table three is substituted using Numerical Index, and then generate table four:
Table four
Finally, training data is transformed to a series of corresponding associated data of Xi and Yi.
Example is shown in Table eight:
Table eight
5, identification process
Input: transformed data to be identified, transform method are identical with the transform method of training data.
Using trained CNN model parameter, by forward operations such as convolution, Chi Hua, activation, the output of final output layer is every Item data to be identified belong to the probability of types of applications, such as table five:
Table five
Finally take the corresponding Apply Names of maximum probability value as the judgement of the data as a result, as taken C.exe in example For the recognition result of the data.
6, recognition result post-processes
It after identifying application program, can compare, determine whether in limitation list, to take phase with known application library Answer measure.The statistics in data is done in the application that can be used simultaneously with host each in local area network, is applied in local area network with understanding Distributed number situation.For certain application, corresponding processing strategie is taken according to the difference of probability value.
The case where comparison and corresponding processing method such as table nine:
Table nine
The threshold value for being identified as certain applied probability height artificially can rule of thumb be set, and such as larger than 0.3 thinks probability height, be known Other result is reliable;Think that probability is low less than 0.3, recognition result is unreliable etc..
Based on the same inventive concept, the embodiment of the invention also provides a kind of, and the application identification model based on deep learning is built Vertical device is provided at least one on host and has data applied to the environment that host and network side node carry out data transmission The host processes of processing capacity.Fig. 4 shows the application identification model according to an embodiment of the invention based on deep learning Establish the structural schematic diagram of device.Referring to fig. 4, which includes at least:
First obtains module 410, suitable for obtaining a plurality of host data of host transmission, wherein carry in each host data There is the host processes title handled in host the host data;
Second obtains module 420, is suitable for obtaining the received a plurality of data on flows of network side node, wherein each data on flows In carry data pack load when network side node receives the data on flows;
Comparison module 430 obtains module 420 with the first acquisition module 410, second respectively and couples, is suitable for each host number It is compared according to each data on flows, to find out at least a pair of of the host data and data on flows that wherein have relevance;
Third obtain module 440, coupled with comparison module 430, suitable for each to the host data and stream that have relevance The parameter of amount data is handled, each to host processes corresponding to the host data for having relevance and data on flows to obtain Corresponding relationship between title and data pack load;
Module 450 is established, module 440 is obtained with third and couples, suitable for being contained using each pair of host processes title with data The corresponding relationship of lotus, which is established, applies identification model.
In a preferred embodiment, comparison module 430 is further adapted for:
Each host data is compared with each parameter that each data on flows carries;
It is identical according to multiple groups parameter, alternatively, identical parameters ratio is more than the comparison rules of proportion threshold value, have to find out The host data and data on flows pair of relevance.
In a preferred embodiment,
The parameter that host data carries includes at least: transmission time, source IP address, source port number, the target of host data IP address, destination port number, the process title for handling host data;
The parameter that data on flows carries includes at least: receiving time, source IP address, source port number, the target of data on flows IP address, destination port number, the data pack load of data on flows.
In a preferred embodiment, comparison module 430 is further adapted for:
It is identical according to multiple groups parameter, alternatively, identical parameters ratio is more than the comparison rules of proportion threshold value, have to find out The host data of relevance and data on flows are to later, according to screening rule to the determining host data and stream for having relevance Data are measured to screening, further screening wherein has the host data and data on flows pair of spurious correlation;
Delete the host data and data on flows pair for having spurious correlation.
In a preferred embodiment, comparison module 430 is further adapted for:
If a host data and two or more data on flows have relevance, it is determined that this relevance is spurious correlation; Or
If a host data and a data on flows have relevance, but the two time difference is more than time difference threshold value, Then determine that this relevance is spurious correlation.
In a preferred embodiment, module 450 is established to be further adapted for:
Machine language conversion is carried out to host processes title and data pack load respectively, is converted into machine recognizable machine Data;
Corresponding relationship is further established between host processes title and data pack load in post-conversion, and is closed using the correspondence System, which establishes, applies identification model.
In a preferred embodiment, module 450 is established to be further adapted for:
Host processes title and since 0 and one by one incremental ordered list are mapped, by each host processes title It is converted into corresponding natural number.
In a preferred embodiment, module 450 is established to be further adapted for:
Corresponding decimal number is converted by the data pack load of hexadecimal string;
To the decimal number after conversion divided by 255, L [0,1] floating numbers are obtained, wherein L is the length of data pack load Degree.
Based on the same inventive concept, the embodiment of the invention also provides a kind of identification devices of data on flows.Fig. 5 is shown The structural schematic diagram of the identification device of data on flows according to an embodiment of the invention.Referring to Fig. 5, which is included at least:
Receiving module 510 is suitable for receiving data on flows, wherein carry network side node in data on flows and receive the stream Measure data pack load when data;
Conversion module 520 is coupled with receiving module 510, suitable for data on flows is converted to knowing using identification model Other data;
Input module 530 is coupled with conversion module 520, is suitable for recognizable data input applying identification model, is obtained institute Identification data belong to the probability of different host processes;
Identification module 540 is coupled with input module 530, and the probability suitable for being obtained according to input module 530 identifies flow number According to corresponding host processes.
In a preferred embodiment, conversion module 520 is further adapted for:
Machine language conversion is carried out to the data pack load of data on flows, is converted into using the identifiable number of identification model According to.
In a preferred embodiment, conversion module 520 is further adapted for:
Corresponding decimal number is converted by the data pack load of hexadecimal string;
To the decimal number after conversion divided by 255, L [0,1] floating numbers are obtained, wherein L is the length of data pack load Degree.
In a preferred embodiment, identification module 540 is further adapted for:
Maximum probability value is chosen as the judgement of data on flows as a result, determining the corresponding process title of data on flows.
To sum up, the embodiment of the invention provides the establishment processes of application identification model and subsequent data on flows to identify One system diagram of journey, is specifically shown in Fig. 6.Wherein, host data is entered training data relating module with data on flows and is instructed Practice, and then enters training data conversion module, deep learning model training to set up level of application identification module.Work as input When new data on flows, it is identified by identification data transformation module, be then inputted application program identification module with Obtain recognition result.And then recognition result and application library are compared, and respective handling is carried out to recognition result.
Using the embodiment of the invention provides the establishment processes of application identification model and subsequent data on flows to identify Journey can reach it is following the utility model has the advantages that
In embodiments of the present invention, host data is compared respectively with data on flows, finds out relating dot therein, And then have the host data and data on flows of relevance according to relevance screening.In host computer side, host data be by specifically into Journey issues, and in network side, data on flows can determine that its corresponding data pack load, the embodiment of the present invention pass through master when obtaining The relevance of machine data and data on flows, further practical corresponding host processes title and data pack load are determined in analysis Corresponding relationship, and generated according to the corresponding relationship and apply identification model.The subsequent application identification, can be with when using data on flows According to the corresponding process title for finding the transmission host data of the data pack load of data on flows, and then determine to send the master The application of machine data.And the identification applied can determine the priority of the application, and then determine that the processing of the data on flows is preferential Grade is conducive to reasonably optimizing and configures application and the network in enterprise, guarantees the efficient development of the quick transmitting and work of information. That is, the application identification model of the initiation application of identification data on flows can be established using the embodiment of the present invention, as long as by flow number Which kind of the application hair that can obtain the data on flows rapidly by host using identification model established according to the input embodiment of the present invention Out, it is participated in without artificial, also need not increase identification software in host side, considerably increase the accuracy using identification, convenience And efficiency.
In the instructions provided here, numerous specific details are set forth.It is to be appreciated, however, that implementation of the invention Example can be practiced without these specific details.In some instances, well known method, structure is not been shown in detail And technology, so as not to obscure the understanding of this specification.
Similarly, it should be understood that in order to simplify the disclosure and help to understand one or more of the various inventive aspects, Above in the description of exemplary embodiment of the present invention, each feature of the invention is grouped together into single implementation sometimes In example, figure or descriptions thereof.However, the disclosed method should not be interpreted as reflecting the following intention: i.e. required to protect Shield the present invention claims features more more than feature expressly recited in each claim.More precisely, as following Claims reflect as, inventive aspect is all features less than single embodiment disclosed above.Therefore, Thus the claims for following specific embodiment are expressly incorporated in the specific embodiment, wherein each claim itself All as a separate embodiment of the present invention.
Those skilled in the art will understand that can be carried out adaptively to the module in the equipment in embodiment Change and they are arranged in one or more devices different from this embodiment.It can be the module or list in embodiment Member or component are combined into a module or unit or component, and furthermore they can be divided into multiple submodule or subelement or Sub-component.Other than such feature and/or at least some of process or unit exclude each other, it can use any Combination is to all features disclosed in this specification (including adjoint claim, abstract and attached drawing) and so disclosed All process or units of what method or apparatus are combined.Unless expressly stated otherwise, this specification is (including adjoint power Benefit require, abstract and attached drawing) disclosed in each feature can carry out generation with an alternative feature that provides the same, equivalent, or similar purpose It replaces.
In addition, it will be appreciated by those of skill in the art that although some embodiments described herein include other embodiments In included certain features rather than other feature, but the combination of the feature of different embodiments mean it is of the invention Within the scope of and form different embodiments.For example, in detail in the claims, embodiment claimed it is one of any Can in any combination mode come using.
Various component embodiments of the invention can be implemented in hardware, or to run on one or more processors Software module realize, or be implemented in a combination thereof.It will be understood by those of skill in the art that can be used in practice Microprocessor or digital signal processor (DSP) realize the application identification according to an embodiment of the present invention based on deep learning The some or all function of model foundation device and some or all components in a kind of identification device of data on flows Energy.The present invention is also implemented as some or all equipment or device journey for executing method as described herein Sequence (for example, computer program and computer program product).Such realization program of the invention can store can in computer It reads on medium, or may be in the form of one or more signals.Such signal can be downloaded from an internet website It obtains, is perhaps provided on the carrier signal or is provided in any other form.
It should be noted that the above-mentioned embodiments illustrate rather than limit the invention, and ability Field technique personnel can be designed alternative embodiment without departing from the scope of the appended claims.In the claims, Any reference symbol between parentheses should not be configured to limitations on claims.Word "comprising" does not exclude the presence of not Element or step listed in the claims.Word "a" or "an" located in front of the element does not exclude the presence of multiple such Element.The present invention can be by means of including the hardware of several different elements and being come by means of properly programmed computer real It is existing.In the unit claims listing several devices, several in these devices can be through the same hardware branch To embody.The use of word first, second, and third does not indicate any sequence.These words can be explained and be run after fame Claim.
So far, although those skilled in the art will appreciate that present invention has been shown and described in detail herein multiple shows Example property embodiment still without departing from the spirit and scope of the present invention, still can according to the present disclosure directly Determine or deduce out many other variations or modifications consistent with the principles of the invention.Therefore, the scope of the present invention is understood that and recognizes It is set to and covers all such other variations or modifications.
Based on one aspect of the present invention, it is disclosed that A1, a kind of application identification model foundation side based on deep learning Method is provided at least one on the host and has data applied to the environment that host and network side node carry out data transmission The host processes of processing capacity, comprising:
Obtain a plurality of host data of the host transmission, wherein carried in the host in each host data to this The host processes title that host data is handled;
Obtain the received a plurality of data on flows of the network side node, wherein the network is carried in each data on flows Side gusset receives the data pack load when data on flows;
Each host data is compared with each data on flows, to find out at least a pair of of the host for wherein having relevance Data and data on flows;
Each parameter to the host data and data on flows that have relevance is handled, with obtain it is each to have association Corresponding relationship corresponding to the host data and data on flows of property between host processes title and data pack load;
It is established using the corresponding relationship of each pair of host processes title and data pack load described using identification model.
A2, method according to a1, wherein each host data is compared with each data on flows, to find out it In have at least a pair of of the host data and data on flows of relevance, comprising:
Each host data is compared with each parameter that each data on flows carries;
It is identical according to multiple groups parameter, alternatively, identical parameters ratio is more than the comparison rules of proportion threshold value, have to find out The host data and data on flows pair of relevance.
A3, the method according to A2, wherein
The parameter that host data carries includes at least: transmission time, source IP address, source port number, the target of host data IP address, destination port number, the process title for handling host data;
The parameter that data on flows carries includes at least: receiving time, source IP address, source port number, the target of data on flows IP address, destination port number, the data pack load of data on flows.
A4, the method according to A2, wherein it is identical according to multiple groups parameter, alternatively, identical parameters ratio is more than ratio threshold The comparison rules of value, to find out the host data for having relevance and data on flows to later, further includes:
The determining host data for having relevance and data on flows are further sieved to screening according to screening rule Select the host data and data on flows pair for wherein having spurious correlation;
Have the host data and data on flows pair of spurious correlation described in deletion.
A5, method according to a4, wherein according to screening rule to the determining host data and stream for having relevance Data are measured to screening, further screening wherein has the host data and data on flows pair of spurious correlation, including following At least one:
If a host data and two or more data on flows have relevance, it is determined that this relevance is spurious correlation;
If a host data and a data on flows have relevance, but the two time difference is more than time difference threshold value, Then determine that this relevance is spurious correlation.
A6, according to the described in any item methods of A1 to A5, wherein utilize each pair of host processes title and data pack load Corresponding relationship is established described using identification model, comprising:
Machine language conversion is carried out to host processes title and data pack load respectively, is converted into machine recognizable machine Data;
Corresponding relationship is further established between host processes title and data pack load in post-conversion, and is closed using the correspondence System establishes described using identification model.
A7, the method according to A6, wherein machine language conversion is carried out to host processes title, being converted into machine can The machine data of identification, comprising:
Host processes title and since 0 and one by one incremental ordered list are mapped, by each host processes title It is converted into corresponding natural number.
A8, the method according to A6 or A7, wherein machine language conversion is carried out to data pack load, is converted into machine Identifiable machine data, comprising:
Corresponding decimal number is converted by the data pack load of hexadecimal string;
To the decimal number after conversion divided by 255, L [0,1] floating numbers are obtained, wherein L is the length of data pack load Degree.
A9, according to the described in any item methods of A1 to A8, wherein the application identification model uses as follows, packet It includes:
The input data using identification model is obtained, is handled through convolutional layer and pond layer, generates the depth of input data Spend feature;
The depth characteristic is sent to full articulamentum identical with neural network, and the depth characteristic is parsed;
The parsing result of the depth characteristic is transmitted to output layer by the full articulamentum, is exported outward.
A10, the method according to A9, wherein the convolutional layer and pond layer multi-layer superposition use, and superposition is got over More, the depth characteristic is deeper.
A11, the method according to A9 or A10, wherein the convolutional layer and the pond layer use in pairs.
A12, according to the described in any item methods of A9 to A11, wherein the window dimension of the convolutional layer and the pond layer For 1*n.
Based on another aspect of the present invention, it is disclosed that B13, a kind of recognition methods of data on flows, comprising:
Receive data on flows, wherein number when network side node receives the data on flows is carried in the data on flows According to payload package;
The data on flows is converted into the recognizable data using identification model;
The recognizable data input is described using identification model, it obtains identified data and belongs to different host processes Probability;
The corresponding host processes of the data on flows are identified according to the obtained probability.
B14, method according to b13, wherein the data on flows is converted into the knowing using identification model Other data, comprising:
Machine language conversion is carried out to the data pack load of the data on flows, being converted into the application identification model can know Other data.
B15, method according to b14, wherein machine language is carried out to the data pack load of the data on flows and is turned Change, be converted into the identifiable data of the application identification model, comprising:
Corresponding decimal number is converted by the data pack load of hexadecimal string;
To the decimal number after conversion divided by 255, L [0,1] floating numbers are obtained, wherein L is the length of data pack load Degree.
B16, according to the described in any item methods of B13 to B15, wherein the flow is identified according to the obtained probability The corresponding host processes of data, comprising:
Maximum probability value is chosen as the judgement of the data on flows as a result, determine the corresponding host of the data on flows into Journey title.
Based on another aspect of the present invention, it is disclosed that C17, a kind of application identification model based on deep learning are built Vertical device is provided at least one on the host and has applied to the environment that host and network side node carry out data transmission The host processes of data-handling capacity, comprising:
First obtains module, suitable for obtaining a plurality of host data of the host transmission, wherein carry in each host data There is the host processes title handled in the host the host data;
Second obtains module, is suitable for obtaining the received a plurality of data on flows of network side node, wherein each data on flows In carry data pack load when the network side node receives the data on flows;
Comparison module wherein has relevance suitable for each host data is compared with each data on flows to find out At least a pair of of host data and data on flows;
Third obtains module, suitable for handling each parameter to the host data and data on flows that have relevance, It is each between host processes title and data pack load corresponding to the host data for having relevance and data on flows to obtain Corresponding relationship;
Module is established, is known suitable for establishing the application using the corresponding relationship of each pair of host processes title and data pack load Other model.
C18, the device according to C17, wherein the comparison module is further adapted for:
Each host data is compared with each parameter that each data on flows carries;
It is identical according to multiple groups parameter, alternatively, identical parameters ratio is more than the comparison rules of proportion threshold value, have to find out The host data and data on flows pair of relevance.
C19, the device according to C18, wherein
The parameter that host data carries includes at least: transmission time, source IP address, source port number, the target of host data IP address, destination port number, the process title for handling host data;
The parameter that data on flows carries includes at least: receiving time, source IP address, source port number, the target of data on flows IP address, destination port number, the data pack load of data on flows.
C20, the device according to C18, wherein the comparison module is further adapted for:
It is identical according to multiple groups parameter, alternatively, identical parameters ratio is more than the comparison rules of proportion threshold value, have to find out The host data of relevance and data on flows are to later, according to screening rule to the determining host data and stream for having relevance Data are measured to screening, further screening wherein has the host data and data on flows pair of spurious correlation;
Have the host data and data on flows pair of spurious correlation described in deletion.
C21, the device according to C20, wherein the comparison module is further adapted for:
If a host data and two or more data on flows have relevance, it is determined that this relevance is spurious correlation; Or
If a host data and a data on flows have relevance, but the two time difference is more than time difference threshold value, Then determine that this relevance is spurious correlation.
C22, according to the described in any item devices of C17 to C21, wherein the module of establishing is further adapted for:
Machine language conversion is carried out to host processes title and data pack load respectively, is converted into machine recognizable machine Data;
Corresponding relationship is further established between host processes title and data pack load in post-conversion, and is closed using the correspondence System establishes described using identification model.
C23, the device according to C22, wherein the module of establishing is further adapted for:
Host processes title and since 0 and one by one incremental ordered list are mapped, by each host processes title It is converted into corresponding natural number.
C24, the device according to C22 or C23, wherein the module of establishing is further adapted for:
Corresponding decimal number is converted by the data pack load of hexadecimal string;
To the decimal number after conversion divided by 255, L [0,1] floating numbers are obtained, wherein L is the length of data pack load Degree.
Based on another aspect of the present invention, it is disclosed that D25, a kind of identification device of data on flows, comprising:
Receiving module is suitable for receiving data on flows, wherein carry network side node in the data on flows and receive the stream Measure data pack load when data;
Conversion module, suitable for the data on flows to be converted to the recognizable data using identification model;
Input module, it is described using identification model suitable for inputting the recognizable data, it obtains identified data and belongs to The probability of different host processes;
Identification module, suitable for the probability that is obtained according to the input module identify the corresponding host of the data on flows into Journey.
D26, the device according to D25, wherein the conversion module is further adapted for:
Machine language conversion is carried out to the data pack load of the data on flows, being converted into the application identification model can know Other data.
D27, the device according to D26, wherein the conversion module is further adapted for:
Corresponding decimal number is converted by the data pack load of hexadecimal string;
To the decimal number after conversion divided by 255, L [0,1] floating numbers are obtained, wherein L is the length of data pack load Degree.
D28, according to the described in any item devices of D25 to D27, wherein the identification module is further adapted for:
Maximum probability value is chosen as the judgement of the data on flows as a result, determining the corresponding process name of the data on flows Claim.

Claims (27)

1. a kind of application identification model method for building up based on deep learning, is applied to host and network side node carries out data biography Defeated environment is provided at least one host processes for having data-handling capacity on the host, comprising:
Obtain a plurality of host data of the host transmission, wherein carried in the host in each host data to the host The host processes title that data are handled;
Obtain the received a plurality of data on flows of the network side node, wherein the network side section is carried in each data on flows Point receives the data pack load when data on flows;
Each host data is compared with each data on flows, to find out at least a pair of of the host data for wherein having relevance And data on flows;
Each parameter to the host data and data on flows that have relevance is handled, it is each to having relevance to obtain Corresponding relationship between host processes title and data pack load corresponding to host data and data on flows;
It is established using the corresponding relationship of each pair of host processes title and data pack load described using identification model;
The application identification model uses as follows, comprising:
The input data using identification model is obtained, is handled through convolutional layer and pond layer, the depth for generating input data is special Sign;
The depth characteristic is sent to full articulamentum identical with neural network, and the depth characteristic is parsed;
The parsing result of the depth characteristic is transmitted to output layer by the full articulamentum, is exported outward.
2. according to the method described in claim 1, wherein, each host data is compared with each data on flows, to find out Wherein have at least a pair of of the host data and data on flows of relevance, comprising:
Each host data is compared with each parameter that each data on flows carries;
It is identical according to multiple groups parameter, alternatively, identical parameters ratio is more than the comparison rules of proportion threshold value, have association to find out The host data and data on flows pair of property.
3. according to the method described in claim 2, wherein,
The parameter that host data carries includes at least: the transmission time of host data, source IP address, source port number, Target IP Location, destination port number, the process title for handling host data;
The parameter that data on flows carries includes at least: the receiving time of data on flows, source IP address, source port number, Target IP Location, destination port number, the data pack load of data on flows.
4. it is identical according to multiple groups parameter according to the method described in claim 2, wherein, alternatively, identical parameters ratio is more than ratio The comparison rules of threshold value, to find out the host data for having relevance and data on flows to later, further includes:
According to screening rule to the determining host data for having relevance and data on flows to screening, further screening Wherein have the host data and data on flows pair of spurious correlation;
Have the host data and data on flows pair of spurious correlation described in deletion.
5. according to the method described in claim 4, wherein, according to screening rule to the determining host data for having relevance and For data on flows to screening, further screening wherein has a host data and data on flows pair of spurious correlation, including under At least one column:
If a host data and two or more data on flows have relevance, it is determined that this relevance is spurious correlation;
If a host data and a data on flows have relevance, but the two time difference is more than time difference threshold value, then really This fixed relevance is spurious correlation.
6. method according to any one of claims 1 to 5, wherein utilize each pair of host processes title and data pack load Corresponding relationship establish it is described using identification model, comprising:
Machine language conversion is carried out to host processes title and data pack load respectively, is converted into machine recognizable number of machines According to;
Corresponding relationship further is established between host processes title and data pack load in post-conversion, and is built using the corresponding relationship It stands described using identification model.
7. according to the method described in claim 6, wherein, carrying out machine language conversion to host processes title, being converted into machine Identifiable machine data, comprising:
Host processes title and since 0 and one by one incremental ordered list are mapped, each host processes title is converted For corresponding natural number.
8. according to the method described in claim 6, wherein, carrying out machine language conversion to data pack load, being converted into machine can The machine data of identification, comprising:
Corresponding decimal number is converted by the data pack load of hexadecimal string;
To the decimal number after conversion divided by 255, L [0,1] floating numbers are obtained, wherein L is the length of data pack load.
9. according to the method described in claim 1, wherein, the convolutional layer and pond layer multi-layer superposition use, and are superimposed More, the depth characteristic is deeper.
10. method according to any one of claims 1 to 5, wherein the convolutional layer and the pond layer use in pairs.
11. method according to any one of claims 1 to 5, wherein the window dimension of the convolutional layer and the pond layer It is the size of convolution or pond basic unit for 1*n, n.
12. a kind of recognition methods of data on flows utilizes the described in any item applications based on deep learning of claim 1-11 The application identification model that identification model method for building up is established, the recognition methods include:
Receive data on flows, wherein data packet when network side node receives the data on flows is carried in the data on flows Load;
The data on flows is converted into the recognizable data using identification model;
The recognizable data input is described using identification model, it obtains identified data and belongs to the general of different host processes Rate;
The corresponding host processes of the data on flows are identified according to the obtained probability.
13. according to the method for claim 12, wherein by the data on flows be converted to it is described using identification model can Identify data, comprising:
Machine language conversion is carried out to the data pack load of the data on flows, it is identifiable to be converted into the application identification model Data.
14. according to the method for claim 13, wherein carry out machine language to the data pack load of the data on flows and turn Change, be converted into the identifiable data of the application identification model, comprising:
Corresponding decimal number is converted by the data pack load of hexadecimal string;
To the decimal number after conversion divided by 255, L [0,1] floating numbers are obtained, wherein L is the length of data pack load.
15. 2 to 14 described in any item methods according to claim 1, wherein identify the flow according to the obtained probability The corresponding host processes of data, comprising:
Maximum probability value is chosen as the judgement of the data on flows as a result, determining the corresponding host processes name of the data on flows Claim.
16. a kind of application identification model based on deep learning establishes device, it is applied to host and network side node carries out data The environment of transmission is provided at least one host processes for having data-handling capacity on the host, comprising:
First obtains module, suitable for obtaining a plurality of host data of the host transmission, wherein carry in each host data State the host processes title handled in host the host data;
Second obtains module, is suitable for obtaining the received a plurality of data on flows of network side node, wherein take in each data on flows The data pack load when data on flows is received with the network side node;
Comparison module has relevance extremely to find out wherein suitable for each host data is compared with each data on flows Few a pair of host data and data on flows;
Third obtains module, suitable for handling each parameter to the host data and data on flows that have relevance, to obtain Take it is each to host processes title corresponding to the host data for having relevance and data on flows with it is corresponding between data pack load Relationship;
Module is established, identifies mould suitable for establishing the application using the corresponding relationship of each pair of host processes title and data pack load Type;Wherein, the application identification model uses as follows, comprising: obtains the input data using identification model, warp Convolutional layer and the processing of pond layer, generate the depth characteristic of input data;The depth characteristic is sent to identical with neural network Full articulamentum, and the depth characteristic is parsed;The parsing result of the depth characteristic is transmitted by the full articulamentum To output layer, export outward.
17. device according to claim 16, wherein the comparison module is further adapted for:
Each host data is compared with each parameter that each data on flows carries;
It is identical according to multiple groups parameter, alternatively, identical parameters ratio is more than the comparison rules of proportion threshold value, have association to find out The host data and data on flows pair of property.
18. device according to claim 17, wherein
The parameter that host data carries includes at least: the transmission time of host data, source IP address, source port number, Target IP Location, destination port number, the process title for handling host data;
The parameter that data on flows carries includes at least: the receiving time of data on flows, source IP address, source port number, Target IP Location, destination port number, the data pack load of data on flows.
19. device according to claim 18, wherein the comparison module is further adapted for:
It is identical according to multiple groups parameter, alternatively, identical parameters ratio is more than the comparison rules of proportion threshold value, have association to find out The host data of property and data on flows are to later, according to screening rule to the determining host data and flow number for having relevance According to screening, further screening wherein has the host data and data on flows pair of spurious correlation;
Have the host data and data on flows pair of spurious correlation described in deletion.
20. device according to claim 19, wherein the comparison module is further adapted for:
If a host data and two or more data on flows have relevance, it is determined that this relevance is spurious correlation;Or
If a host data and a data on flows have relevance, but the two time difference is more than time difference threshold value, then really This fixed relevance is spurious correlation.
21. 6 to 20 described in any item devices according to claim 1, wherein the module of establishing is further adapted for:
Machine language conversion is carried out to host processes title and data pack load respectively, is converted into machine recognizable number of machines According to;
Corresponding relationship further is established between host processes title and data pack load in post-conversion, and is built using the corresponding relationship It stands described using identification model.
22. device according to claim 21, wherein the module of establishing is further adapted for:
Host processes title and since 0 and one by one incremental ordered list are mapped, each host processes title is converted For corresponding natural number.
23. device according to claim 21, wherein the module of establishing is further adapted for:
Corresponding decimal number is converted by the data pack load of hexadecimal string;
To the decimal number after conversion divided by 255, L [0,1] floating numbers are obtained, wherein L is the length of data pack load.
24. a kind of identification device of data on flows utilizes described in any item the answering based on deep learning of claim 16 to 23 The application identification model that device is established is established with identification model, the identification device includes:
Receiving module is suitable for receiving data on flows, wherein carry network side node in the data on flows and receive the flow number According to when data pack load;
Conversion module, suitable for the data on flows to be converted to the recognizable data using identification model;
Input module, it is described using identification model suitable for inputting the recognizable data, it obtains identified data and belongs to difference The probability of host processes;
Identification module, the probability suitable for being obtained according to the input module identify the corresponding host processes of the data on flows.
25. device according to claim 24, wherein the conversion module is further adapted for:
Machine language conversion is carried out to the data pack load of the data on flows, it is identifiable to be converted into the application identification model Data.
26. device according to claim 25, wherein the conversion module is further adapted for:
Corresponding decimal number is converted by the data pack load of hexadecimal string;
To the decimal number after conversion divided by 255, L [0,1] floating numbers are obtained, wherein L is the length of data pack load.
27. according to the described in any item devices of claim 24 to 26, wherein the identification module is further adapted for:
Maximum probability value is chosen as the judgement of the data on flows as a result, determining the corresponding process title of the data on flows.
CN201610018242.2A 2016-01-12 2016-01-12 Using identification model method for building up, the recognition methods of data on flows and device Active CN105516027B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610018242.2A CN105516027B (en) 2016-01-12 2016-01-12 Using identification model method for building up, the recognition methods of data on flows and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610018242.2A CN105516027B (en) 2016-01-12 2016-01-12 Using identification model method for building up, the recognition methods of data on flows and device

Publications (2)

Publication Number Publication Date
CN105516027A CN105516027A (en) 2016-04-20
CN105516027B true CN105516027B (en) 2019-03-12

Family

ID=55723677

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610018242.2A Active CN105516027B (en) 2016-01-12 2016-01-12 Using identification model method for building up, the recognition methods of data on flows and device

Country Status (1)

Country Link
CN (1) CN105516027B (en)

Families Citing this family (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105812188A (en) * 2016-04-25 2016-07-27 北京网康科技有限公司 Traffic recognition method and device
CN106130839B (en) * 2016-07-12 2019-03-01 电子科技大学 A kind of business recognition method applied to broadband access network
CN106790019B (en) * 2016-12-14 2019-10-11 北京天融信网络安全技术有限公司 Encryption method for recognizing flux and device based on feature self study
CN108924090B (en) * 2018-06-04 2020-12-11 上海交通大学 Method for detecting traffics of shadowsocks based on convolutional neural network
CN110784330B (en) * 2018-07-30 2022-04-05 华为技术有限公司 Method and device for generating application recognition model
CN109361617B (en) * 2018-09-26 2022-09-27 中国科学院计算机网络信息中心 Convolutional neural network traffic classification method and system based on network packet load
CN109802868B (en) * 2019-01-10 2022-05-06 中山大学 Mobile application real-time identification method based on cloud computing
CN113326946A (en) * 2020-02-29 2021-08-31 华为技术有限公司 Method, device and storage medium for updating application recognition model
CN114499941B (en) * 2021-12-22 2023-08-04 天翼云科技有限公司 Training and detecting method of flow detection model and electronic equipment
CN116204386B (en) * 2023-04-26 2023-07-28 北京明易达科技股份有限公司 Method, system, medium and equipment for automatically identifying and monitoring application service relationship

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101741908A (en) * 2009-12-25 2010-06-16 青岛朗讯科技通讯设备有限公司 Identification method for application layer protocol characteristic
CN101764748A (en) * 2009-12-16 2010-06-30 福建星网锐捷网络有限公司 Method for identifying application program, device and system thereof
CN105100091A (en) * 2015-07-13 2015-11-25 北京奇虎科技有限公司 Protocol identification method and protocol identification system

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101764748A (en) * 2009-12-16 2010-06-30 福建星网锐捷网络有限公司 Method for identifying application program, device and system thereof
CN101741908A (en) * 2009-12-25 2010-06-16 青岛朗讯科技通讯设备有限公司 Identification method for application layer protocol characteristic
CN105100091A (en) * 2015-07-13 2015-11-25 北京奇虎科技有限公司 Protocol identification method and protocol identification system

Also Published As

Publication number Publication date
CN105516027A (en) 2016-04-20

Similar Documents

Publication Publication Date Title
CN105516027B (en) Using identification model method for building up, the recognition methods of data on flows and device
CN104391881B (en) A kind of daily record analytic method and system based on segmentation methods
CN103870381B (en) A kind of test data generating method and device
US8068431B2 (en) System and method for deep packet inspection
CN108124487A (en) cloud meter reading method and device
US8037057B2 (en) Multi-column statistics usage within index selection tools
CN105100091A (en) Protocol identification method and protocol identification system
CN101453424B (en) Network information resource access control method and system
US20180268081A1 (en) Data extraction
CN110765639A (en) Electrical simulation modeling method and device and readable storage medium
CN105868311A (en) Data analyzing method and device
US11914641B2 (en) Text to color palette generator
CN113760730B (en) Automatic test method and device
CN109240903A (en) A kind of method and apparatus assessed automatically
CN104021147B (en) A kind of code stream analyzing method and device
CN112395371B (en) Financial institution asset classification processing method, device and readable medium
CN117675406A (en) Heterogeneous task flow intelligent analysis method based on power law segmentation length sequence
CN109063040A (en) Client-side program collecting method and system
CN110309214A (en) A kind of instruction executing method and its equipment, storage medium, server
CN106293862B (en) A kind of analysis method and device of expandable mark language XML data
CN110493058A (en) The construction method and device of network topology structure, storage medium, terminal
CN103220274B (en) A kind of network message pattern matching process for operator's network outlet and system
CN113835712B (en) Fast data packet routing method for judging according to given field value
US9172595B2 (en) Systems and methods of packet object database management
CN108270599A (en) A kind of data analyzing and processing method and system based on snmp protocol

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CP01 Change in the name or title of a patent holder
CP01 Change in the name or title of a patent holder

Address after: 100088 room 112, block D, 28 new street, new street, Xicheng District, Beijing (Desheng Park)

Co-patentee after: QAX Technology Group Inc.

Patentee after: BEIJING QIHOO TECHNOLOGY Co.,Ltd.

Address before: 100088 room 112, block D, 28 new street, new street, Xicheng District, Beijing (Desheng Park)

Co-patentee before: BEIJING QIANXIN TECHNOLOGY Co.,Ltd.

Patentee before: BEIJING QIHOO TECHNOLOGY Co.,Ltd.