CN112822121A

CN112822121A - Traffic identification method, traffic determination method and knowledge graph establishment method

Info

Publication number: CN112822121A
Application number: CN201911121222.8A
Authority: CN
Inventors: 杨治国
Original assignee: ZTE Corp
Current assignee: ZTE Corp
Priority date: 2019-11-15
Filing date: 2019-11-15
Publication date: 2021-05-18

Abstract

The invention provides a traffic identification method, a traffic determination method and a knowledge graph establishment method, wherein the traffic identification method comprises the following steps: clustering the messages in the target flow according to the similarity degree between the messages in the target flow to obtain one or more flow clusters; extracting the characteristics of each flow cluster to obtain the characteristic information corresponding to each flow cluster; the characteristic information is used for indicating the characteristics of the messages included in the corresponding flow cluster; and uploading the mapping relation between one or more traffic clusters and the corresponding characteristic information to a Deep Packet Inspection (DPI) module so that the DPI module can identify the traffic. The invention can solve the problem that the efficiency and the accuracy of flow identification are easily influenced by manpower in the flow identification process in the related technology, so that the manual characteristic analysis and setting which are relied on in the flow identification process are released to automatically identify the flow, and the efficiency and the accuracy of the flow identification are further improved.

Description

Traffic identification method, traffic determination method and knowledge graph establishment method

Technical Field

The invention relates to the field of communication, in particular to a traffic identification method, a traffic determination method and a knowledge graph establishment method.

Background

The identification of the flow is the basis for realizing upper-layer services such as charging, monitoring, safety control, commercial value analysis and the like, so that the realization of the high-speed and high-identification-rate flow identification is the key point for realizing the upper-layer services with high quality.

At present, the identification of traffic in the related art is implemented by methods such as port, quintuple, flow characteristic, Deep Packet Inspection (DPI), Deep learning, and the like. In the above traffic identification method, although the method based on the port or the quintuple has a high identification speed, it is easily affected by factors such as dynamic IP, port change, port multiplexing, and the like, thereby seriously affecting the identification rate and failing to ensure the identification accuracy, so that the method cannot be used in scenes with high requirements for identification, such as charging, control, and the like. Although the method based on the stream feature uses the physical characteristic of message interaction, the method is non-invasive to the message and can well protect the privacy of the user, the stream feature is easily affected by the network physical environment, service upgrade and the like, and the problem of low identification accuracy rate exists in practical application, so the method is often used as auxiliary identification. The deep learning-based method has high requirements on physical equipment and high resource consumption, and is difficult to cope with the high-flow application scene of a backbone network at present.

In the above traffic identification method, the DPI-based method uses the byte characteristics of the packet, and the characteristics are not affected by the physical environment, port change, and service IP change, so that the traffic identification rate can be ensured, but the DPI needs to rely on the artificial setting characteristics to identify the packet in the implementation process; the above manner not only realizes that certain human resources are consumed for traffic identification based on deep packet inspection, but also enables the accuracy of traffic identification based on deep packet inspection to depend on the quality of the artificially set characteristics.

Aiming at the problem that in the related technology, in the flow identification process, the efficiency and the accuracy of the flow identification are easily affected by manpower, an effective solution is not provided in the related technology.

Disclosure of Invention

The embodiment of the invention provides a flow identification method, a flow determination method and a knowledge graph establishment method, which at least solve the problem that in the flow identification process in the related technology, the efficiency and the accuracy of flow identification are easily influenced by manpower.

According to an embodiment of the present invention, there is provided a traffic identification method including:

clustering the messages in the target flow according to the similarity degree between the messages in the target flow to obtain one or more flow clusters;

extracting the characteristics of each flow cluster to obtain the characteristic information corresponding to each flow cluster; the characteristic information is used for indicating the characteristics of the messages included in the corresponding flow cluster;

and uploading the mapping relation between the one or more traffic clusters and the corresponding characteristic information to a Deep Packet Inspection (DPI) module so that the DPI module can perform traffic identification.

According to another embodiment of the present invention, there is also provided a traffic determination method including:

acquiring first identification information of each flow cluster, and determining entity objects respectively corresponding to one or more flow clusters according to the first identification information and a preset knowledge graph so as to realize flow determination;

the knowledge graph is established according to an entity object and second identification information corresponding to the entity object.

According to another embodiment of the present invention, there is also provided a knowledge graph establishing method, including:

acquiring a data packet of an entity object, and establishing an entity relationship of the entity object according to the data packet of the entity object; the entity relationship comprises a mapping relationship between the entity object and second identification information corresponding to the entity object;

and establishing a knowledge graph according to the entity relationship of the entity object.

According to another embodiment of the present invention, there is also provided a traffic identification apparatus including:

the first clustering module is used for clustering the messages in the target flow according to the similarity degree between the messages in the target flow so as to obtain one or more flow clusters;

the extraction module is used for extracting the characteristics of each flow cluster to obtain the characteristic information corresponding to each flow cluster; the characteristic information is used for indicating the characteristics of the messages included in the corresponding flow cluster;

and the identification module is used for uploading the mapping relation between the one or more traffic clusters and the corresponding feature information to a Deep Packet Inspection (DPI) module so as to enable the DPI module to identify the traffic.

According to another embodiment of the present invention, there is also provided a flow rate determination apparatus including:

the second clustering module is used for clustering the messages in the target flow according to the similarity degree between the messages in the target flow so as to obtain one or more flow clusters;

the determining module is used for acquiring first identification information of each flow cluster, and determining entity objects corresponding to the one or more flow clusters respectively according to the first identification information and a preset knowledge graph so as to realize flow identification;

According to another embodiment of the present invention, there is also provided a knowledge-graph establishing apparatus including:

the acquisition module is used for acquiring a data packet of an entity object and establishing an entity relationship of the entity object according to the data packet of the entity object; the entity relationship comprises a mapping relationship between the entity object and second identification information corresponding to the entity object;

and the establishing module is used for establishing a knowledge graph according to the entity relationship of the entity object.

According to another embodiment of the present invention, a computer-readable storage medium is also provided, in which a computer program is stored, wherein the computer program is configured to perform the steps of any of the above-described method embodiments when executed.

According to another embodiment of the present invention, there is also provided an electronic device, including a memory in which a computer program is stored and a processor configured to execute the computer program to perform the steps in any of the above method embodiments.

According to the invention, because the messages in the target flow can be clustered according to the similarity degree between the messages in the target flow, one or more flow clusters can be obtained; simultaneously, extracting the characteristics of each flow cluster to obtain the characteristic information which corresponds to each flow cluster and is used for indicating the characteristics of the messages included in the corresponding flow cluster; and further uploading the mapping relation between the one or more traffic clusters and the corresponding characteristic information to a DPI module so as to enable the DPI module to perform traffic identification. Therefore, the invention can solve the problem that the efficiency and the accuracy of flow identification are easily influenced by manpower in the flow identification process in the related technology, so as to release the manual characteristic analysis and setting which are relied on in the flow identification process to automatically identify the flow, and further improve the efficiency and the accuracy of the flow identification.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the invention and together with the description serve to explain the invention without limiting the invention. In the drawings:

fig. 1 is a schematic diagram illustrating an operation of a DPI for performing traffic recognition according to a related art;

fig. 2 is a flowchart of a traffic identification method provided according to an embodiment of the present invention;

fig. 3 is a schematic flowchart of message analysis under different application layer protocols according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of a conversion between a similarity matrix and an undirected graph provided in accordance with an embodiment of the present invention;

FIG. 5 is a schematic flow chart of clustering target traffic based on similarity degrees according to an embodiment of the present invention;

FIG. 6 is a schematic flow chart of feature extraction performed by a traffic cluster according to an embodiment of the present invention;

FIG. 7 is a schematic flow chart of verification of feature information provided according to an embodiment of the present invention;

fig. 8 is a schematic flow chart of a flow identification method after flow determination is performed by introducing a knowledge graph according to an embodiment of the present invention;

FIG. 9 is a schematic flowchart of entity object retrieval under different application layer protocols according to an embodiment of the present invention;

fig. 10 is a flowchart of a flow rate determination method provided according to an embodiment of the present invention;

FIG. 11 is a flow diagram of a method of knowledge-graph establishment provided in accordance with an embodiment of the present invention;

FIG. 12 is a schematic flow chart of obtaining an entity object data packet according to an embodiment of the present invention;

FIG. 13 is a schematic diagram of entity relationships provided in accordance with an embodiment of the invention;

fig. 14 is a block diagram of the structure of a traffic recognition apparatus according to an embodiment of the present invention;

fig. 15 is a block diagram of the structure of a flow rate determination device according to an embodiment of the present invention;

fig. 16 is a block diagram of the configuration of a knowledge-graph creating apparatus according to an embodiment of the present invention.

Detailed Description

The invention will be described in detail hereinafter with reference to the accompanying drawings in conjunction with embodiments. It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict.

It should be noted that the terms "first," "second," and the like in the description and claims of the present invention and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order.

To further illustrate the difference between the traffic identification method of the present invention and the related art and the beneficial effect of the traffic identification method of the present invention, the following further describes the process of performing traffic identification based on DPI in the related art.

Fig. 1 is a schematic diagram illustrating a flow identification process performed by a DPI provided in the related art, and as shown in fig. 1, in the flow identification process based on the DPI in the related art, an application, a service, or an internet of things device that may correspond to a flow to be identified needs to be determined first, and on this basis, a data packet or a packet of the application, the service, or the internet of things device is obtained through manual dial-up test processing. After the data packet or the message of the application, the service or the internet of things equipment is manually analyzed, corresponding features such as server information, domain names, keywords and the like can be extracted according to different setting modes; after the characteristics are manually verified and corrected, the characteristic codes can be stored in a DPI characteristic library so as to carry out online detection and identification on the flow.

According to the process, in the process of flow identification based on the DPI in the related technology, the flow detection can be carried out only by setting related features in a DPI feature library, and the setting of the features completely depends on the manual selection range of the application, the business or the Internet of things equipment and the rules adopted in the manual feature extraction, so that a great deal of manpower is consumed for setting the features in the early stage when the flow identification is carried out based on the DPI; on the other hand, as the applications, services or internet of things equipment are increasingly diversified, comprehensive collection is difficult to achieve when the range is manually selected; further, when the rule for feature extraction is manually set, there is a problem that the expressiveness of the features is insufficient.

Example 1

The present embodiment provides a traffic identification method, and fig. 2 is a flowchart of the traffic identification method according to the embodiment of the present invention, and as shown in fig. 2, the traffic identification method in the embodiment includes:

s102, clustering the messages in the target flow according to the similarity degree of the messages in the target flow to obtain one or more flow clusters;

s104, extracting the characteristics of each flow cluster to obtain the characteristic information corresponding to each flow cluster; the characteristic information is used for indicating the characteristics of the messages included in the corresponding flow cluster;

s106, uploading the mapping relation between one or more traffic clusters and the corresponding characteristic information to a Deep Packet Inspection (DPI) module for traffic identification of the DPI module.

It should be further noted that the target flow rate in the step S102 is an unknown flow rate, that is, before the flow rate identification method in this embodiment is executed, features for flow rate identification are not manually set, but the features of the target flow rate are directly extracted according to the steps S102 to S106; therefore, in the flow identification process, the flow identification method in the embodiment basically does not adopt manual intervention, and can comprehensively realize the automation of flow identification.

It should be further noted that, one or more traffic clusters obtained in the step S102 respectively correspond to one application or internet of things device, or correspond to one specific service under one application or internet of things device, that is, the step S102 clusters traffic that may be from different applications or internet of things devices, or traffic of different services under one application or internet of things device according to the similarity degree, so as to obtain a traffic cluster corresponding to the application, the internet of things device, or the service.

It should be further noted that in S106, the mapping relationship between one or more traffic clusters and the corresponding feature information indicates a relationship between each traffic cluster and the corresponding feature information of the traffic cluster.

By the traffic identification method in the embodiment, the messages in the target traffic can be clustered according to the similarity degree between the messages in the target traffic to obtain one or more traffic clusters; simultaneously, extracting the characteristics of each flow cluster to obtain the characteristic information which corresponds to each flow cluster and is used for indicating the characteristics of the messages included in the corresponding flow cluster; and further uploading the mapping relation between the one or more traffic clusters and the corresponding characteristic information to a DPI module so as to enable the DPI module to perform traffic identification. Therefore, the flow identification method in the embodiment can solve the problem that the efficiency and the accuracy of flow identification are easily affected by manpower in the flow identification process in the related technology, so that the manual feature analysis and setting which are relied on in the flow identification process are released to automatically identify the flow, and the efficiency and the accuracy of flow identification are improved.

Specifically, the traffic identification method in this embodiment may automatically extract the features of the corresponding traffic through analysis of the unknown traffic of the existing network, so that manual feature analysis and setting are not required, thereby saving labor and time. Meanwhile, the flow identification method in the embodiment realizes the automatic closed loop from the analysis to the flow identification of the DPI, and avoids the phenomenon of poor feature expressiveness existing in the manual setting of the features, so that the identification rate of the DPI module is obviously improved. In addition, in the flow rate identification method in this embodiment, as the number of samples corresponding to the target flow rate increases, the features stored in the DPI module also gradually increase, so that the identification rate of the DPI module is further gradually increased.

Furthermore, it should be further described that, in step S106 of this embodiment, the DPI module may be configured to, after accessing the traffic, screen the access traffic through a mapping relationship between the traffic cluster stored in the DPI module feature library and the corresponding feature information; specifically, for a known flow rate of the access flow rates, the flow rate identification may be directly performed through the corresponding features, and for an unknown flow rate, steps S102 to S106 in this embodiment may be performed to obtain the features of the unknown flow rate.

It should be further explained that the traffic identification method in this embodiment has different applicable manners for different types of traffic; specifically, for the traffic of which the application layer is the http protocol, the characteristics of the traffic corresponding to the type of the traffic are fixed, so that host, Location and User-Agent fields in a Header in the message stream can be directly extracted as the characteristics of the traffic of the type; similarly, for the traffic of which the application layer is the https protocol, the characteristics of the message corresponding to the traffic of the type are fixed, so that the server _ name field in the handshake message can be directly extracted as the characteristics of the message of the type in the TLS handshake process.

For the two types of flow, because the characteristics of the flow are fixed, the characteristics of the target flow can be directly extracted and identified according to the characteristics, namely, clustering processing is not performed; or, for convenience of processing, clustering the messages with the same characteristics according to the characteristics, that is, corresponding to the step S102, clustering the messages in the target traffic according to the similarity degree between the messages in the target traffic to obtain one or more traffic clusters; the degree of similarity in this case indicates the same characteristics, i.e. the traffic is clustered according to the same characteristics (corresponding fields); and clustering the flow of the http(s) protocol according to the steps, so that the flow corresponding to the corresponding application, the Internet of things equipment or the service can be clustered into a flow cluster to be conveniently processed.

For the traffic with the application layer being the non-http(s) protocol, the characteristics of the message corresponding to the type of traffic are non-fixed, and therefore, the characteristics of the type of traffic cannot be directly determined. In this case, it is necessary to perform clustering processing on the target traffic according to the degree of similarity, so that the corresponding features can be further extracted. In this case, the target traffic message in step S102 may be an original message of the type of traffic, specifically, the first 16 interactive messages of each long flow intercepted in the message flow, and all short flow messages. Fig. 3 is a schematic flow chart of message analysis under different application layer protocols according to an embodiment of the present invention, where the message analysis manner of flows corresponding to the different application layer protocols is shown in fig. 3.

The following describes an alternative implementation of the clustering for the traffic of which the application layer is a non-http(s) protocol, by using a plurality of alternative embodiments.

In an optional embodiment, in the step S102, performing clustering processing on the packets in the target traffic according to the similarity degree between the packets in the target traffic to obtain one or more traffic clusters includes:

classifying the target traffic according to the following objects to obtain one or more triples: server internet interconnection protocol IP, PORT PORT, transport layer protocol;

clustering one or more triples according to the similarity degree between the messages in any two triples to obtain one or more triples sets; wherein each triple set is used to indicate a traffic cluster.

It should be further noted that, in the above alternative embodiment, the target traffic is classified according to the following objects to obtain one or more triples: server IP, PORT, transport layer protocol, i.e. indicating that the target traffic is subject to the server IP, PORT, transport layer protocol to form a plurality of triplets, each triplet may be understood as extracting the TCP or UDP load in the target traffic as a sample for subsequent processing.

In an optional embodiment, before performing clustering processing on one or more triples according to a similarity degree between packets in any two triples, the method further includes:

and carrying out duplicate removal processing on the message in each triple.

It should be further noted that the foregoing deduplication processing can avoid a large number of identical packets, such as heartbeat packets, existing in the triplets, thereby causing unnecessary computation burden on the system.

clustering the messages in each triple according to the message types to obtain one or more message type groups corresponding to each triple; the one or more message type groups are used for indicating messages corresponding to one or more message types in the triple;

and extracting the same number of sample messages from each message type group, and re-determining the triples corresponding to one or more message type groups according to the message samples corresponding to one or more message type groups.

It should be further noted that the above alternative embodiment may perform uniform distribution processing on different types of packets in each triple. For example, in the case of a social application, the message data generated by the social application may include text data, voice data, video data, and the like. The distribution of the different types of messages in the target traffic has the possibility of being non-uniform, and in this case, the processing effect in the clustering process is easily affected, and meanwhile, the detection rate of subsequent feature extraction is also reduced. Therefore, the technical scheme further clusters the messages in each triple according to the message types so as to group the messages of different types; furthermore, each message type group extracts the same number of sample messages, and the triplets corresponding to one or more message type groups are re-determined according to the message samples corresponding to one or more message type groups, so that the messages of different types in the re-determined triplets can be uniformly distributed.

For example, the triple a includes three types of message data, i.e., text data, voice data, and video data, wherein the text data occupies 10 frames, the voice data occupies 20 frames, and the video data occupies 12 frames, and then the triple a may be clustered to obtain three message type groups, which respectively correspond to the text data, the voice data, and the video data; on this basis, 10 frames of data can be respectively extracted from each message type group, that is, 10 frames of text data, 10 frames of voice data and 10 frames of video data are respectively obtained, and a new triple is reconstructed from the data, so that the new triple is specifically composed of 10 frames of text data, 10 frames of voice data and 10 frames of video data.

The technical solution described in the above optional embodiment can significantly improve the clustering effect according to the similarity degree in step S102, and can also ensure the detection rate of the subsequent feature extraction.

In an optional embodiment, the clustering the packets in each triple according to the packet type includes:

and clustering the messages in each triple according to a preset message type clustering mode.

It should be further noted that the message type clustering manner may include a document theme generation model LDA, or may be based on a word vector clustering algorithm, a word frequency clustering algorithm, or other manners, that is, an algorithm or a model for clustering the messages in each triple according to the message type may be applicable, and the present invention is not limited thereto.

In an optional embodiment, the clustering, according to the similarity degree between the packets in any two triples, one or more triples to obtain one or more triple sets includes:

acquiring a similarity matrix according to a preset similarity measurement mode, wherein the similarity matrix is used for indicating the similarity degree between messages in any two triples;

converting the similarity matrix into an undirected graph containing weights, wherein the weights are used for indicating similarity values;

and carrying out community division processing on the undirected graph to obtain one or more triple sets.

It should be further noted that, in the above optional embodiment, a specific process of obtaining the similarity matrix according to a preset similarity measurement mode may refer to a classification mode of similar documents in natural language processing; specifically, the message in each triplet may be regarded as an independent document, and a clustering process based on the similarity degree is performed on the one or more documents (corresponding to one or more triplets) by using a natural language processing technique. In the natural language processing technology, establishing a corresponding similarity matrix according to the similarity of documents is known to those skilled in the art, and thus, the present invention is not described herein again.

In the above optional embodiment, the preset similarity measurement mode may adopt a similarity measurement mode based on a word frequency cosine vector, or may adopt a similarity measurement mode based on a word vector cosine vector, which is not limited in this invention.

The similarity matrix is a symmetric matrix; the similarity matrix may exist throughout the identification of traffic and is iteratively updated as the triplet samples in the target traffic increase. Fig. 4 is a schematic diagram of conversion between a similarity matrix and an undirected graph, where the process of converting the similarity matrix into the undirected graph containing weights is shown in fig. 4. The above-mentioned community division processing is performed on the undirected graph, that is, the data in the undirected graph is divided by adopting a community division algorithm.

In an optional embodiment, the community partitioning processing on the undirected graph to obtain one or more triple sets includes:

performing community partitioning processing on the undirected graph to determine one or more first boundary values;

the first boundary value is modified to determine one or more second boundary values, and one or more triple sets are determined based on the one or more second boundary values.

It should be further noted that the first boundary value is a boundary value that is divided by a threshold value based on the weight of the undirected graph completely in the community division processing; partitioning based on the first boundary value described above may result in triples belonging to the same application being partitioned into different sets of triples. In this regard, a correction may be made on the basis of the first boundary value to determine a second boundary value; the specific correction process can be analyzed by introducing human factors engineering.

Fig. 5 is a schematic flow chart of clustering based on similarity degrees of target traffic according to an embodiment of the present invention, where in step S102, for target traffic whose application layer protocol is a non-http (S) protocol, clustering processing is performed on packets in the target traffic according to similarity degrees between packets in the target traffic to obtain one or more traffic clusters, as shown in fig. 5.

In an optional embodiment, in the step S104, performing feature extraction on each traffic cluster to obtain feature information corresponding to each traffic cluster, includes:

clustering the triples in each flow cluster according to the characteristic types to obtain one or more sub-flow clusters corresponding to each flow cluster;

extracting one or more pieces of sub-feature information in each sub-traffic cluster according to a preset feature extraction mode;

and determining the characteristic information corresponding to the flow clusters according to one or more sub-characteristic information corresponding to one or more sub-flow clusters.

In an optional embodiment, the clustering, according to the characteristic type, the triple in each traffic cluster to obtain one or more sub-traffic clusters corresponding to each traffic cluster includes:

and clustering the triples in each flow cluster according to a preset characteristic type clustering mode.

It should be further noted that the feature type clustering manner may include a document theme generation model LDA, or may be based on a word vector clustering algorithm, a word frequency clustering algorithm, or other manners, that is, an algorithm or a model for clustering the messages in each triple according to the feature type may be applicable, and the present invention is not limited thereto.

In an optional embodiment, in the step S104, the sub-feature information includes at least one of: fixed domain features, enumerated domain features, keyword field features, regular expression features, length domain features;

extracting one or more pieces of sub-feature information in each sub-traffic cluster according to a preset feature extraction mode, wherein the extracting step comprises the following steps:

extracting fixed domain features and enumerated domain features of the sub-traffic clusters according to a matrix alignment algorithm, wherein the fixed domain features and the enumerated domain features are used for indicating the features of the messages of the sub-traffic clusters at fixed positions; alternatively, the first and second electrodes may be,

extracting keyword field characteristics in the sub-flow clusters according to a common substring algorithm, wherein the keyword field characteristics are used for indicating keywords existing in each message in the sub-flow clusters; alternatively, the first and second electrodes may be,

extracting regular expression characteristics in the sub-flow clusters according to a multi-sequence alignment algorithm, wherein the regular expression characteristics are used for indicating regular expressions of messages in the sub-flow clusters; alternatively, the first and second electrodes may be,

and extracting the length domain characteristics of the sub-flow clusters according to the relation of the byte domain lengths of the specified positions, wherein the length domain characteristics are used for indicating the lengths of the messages in the sub-flow clusters.

In an optional embodiment, in the step S104, determining the characteristic information corresponding to the traffic cluster according to one or more sub-characteristic information corresponding to one or more sub-traffic clusters includes:

determining characteristic information corresponding to the flow clusters according to one or more sub-characteristic information corresponding to one or more sub-flow clusters; the characteristic information comprises one or more pieces of sub-characteristic information corresponding to at least one sub-traffic cluster.

It should be further noted that, in the foregoing optional embodiment, the feature information includes one or more sub-feature information corresponding to at least one sub-traffic cluster, that is, for one traffic cluster, the feature information corresponding to the traffic cluster is information obtained by performing association between the sub-feature information corresponding to the one or more sub-traffic clusters corresponding to the traffic cluster by using an or logic, that is, it is satisfied that the corresponding sub-feature information in any sub-traffic cluster can be used as the feature information of the traffic cluster. Fig. 6 is a schematic flow chart of feature extraction performed by a traffic cluster according to an embodiment of the present invention, and the flow of feature extraction performed by the traffic cluster is shown in fig. 6.

In an optional embodiment, in step S106, uploading the mapping relationship between one or more traffic clusters and corresponding feature information to the deep packet inspection DPI module includes:

and verifying the characteristic information corresponding to one or more flow clusters, and uploading the mapping relation between the verified flow clusters and the corresponding characteristic information to the DPI module.

In an optional embodiment, the verifying the characteristic information corresponding to one or more traffic clusters includes:

performing feature verification on each triple in the flow cluster corresponding to the feature information according to the feature information to determine the detection rate of the feature information;

performing characteristic verification on other traffic clusters and/or known traffic according to the characteristic information to determine the false detection rate of the characteristic information;

and verifying the characteristic information according to the relation between the detection rate and the preset detection threshold value and the relation between the false detection rate and the preset false detection threshold value.

It should be further noted that, in the above alternative embodiment, the feature information is verified from two dimensions, on one hand, each triple in the traffic cluster corresponding to the feature information is verified according to the feature information, so as to determine the detection rate of the feature information, that is, whether the feature information can effectively express the traffic cluster. On the other hand, other traffic clusters and/or known traffic are verified according to the characteristic information to determine the false detection rate of the characteristic information, that is, whether other traffic clusters and/or known traffic can also be expressed through the characteristic information; the other traffic clusters are traffic clusters not corresponding to the characteristic information, and the known traffic indicates a sample traffic with determined characteristics.

In the process of verifying the feature information according to the relationship between the detection rate and the preset detection threshold and the relationship between the false detection rate and the preset false detection threshold, the feature information can be uploaded to a DPI module only when the feature information passes verification in the two dimensions; if the feature information fails to pass the verification, further determination is made as to whether the feature information belongs to overfitting (the detection rate is not satisfactory) or underfitting (the detection rate is satisfactory, but the false detection rate is too high). In case the feature fails to pass the verification, the parameters are revised in the feature extraction stage, i.e. step S104, according to the above over-fitting or under-fitting situation. Fig. 7 is a schematic diagram of a process of verifying the feature information according to an embodiment of the present invention, where the process of verifying the feature information is shown in fig. 7. In fig. 7, a known traffic sample library is known traffic in the above alternative embodiment, and a traffic cluster sampling message corresponds to a message in the above traffic cluster.

In an optional embodiment, in the steps S102 to S106, the target traffic is traffic of the same client, or the target traffic is traffic of different clients.

In an optional embodiment, in the step S102, after performing clustering processing on the packets in the target traffic according to the similarity degree between the packets in the target traffic to obtain one or more traffic clusters, the method further includes:

the knowledge graph is established according to the entity object and second identification information corresponding to the entity object.

It should be further noted that, in the above alternative embodiment, the preset knowledge graph may be constructed independently of the flow identification process in the present embodiment. The method can collect relevant information of the entity corresponding to the target flow through the knowledge graph, further comb the relation between different entities to determine the relation network between the entities, and further can not only realize the identification of the flow, but also further determine the entity object corresponding to the target flow and possible associated information in the flow identification process.

During the process of establishing the knowledge graph, on one hand, knowledge extraction, namely collection of second identification information, and on the other hand, knowledge processing, namely construction of entity relations, namely modeling for business scenarios are performed. The information source of the knowledge extraction mainly comes from the following blocks: 1) crawling an application market to obtain application names, descriptions, official nets, development teams, application categories, and the like; 2) filing data of a development team (developer); 3) domain name and record information of website; 4) relevant information associated with the IP address; 5) simply dialing the measured data; 6) customer-provided association data; 7) searching websites; 8) common internet of things device information. The above-described crawler techniques are known to those skilled in the art, and thus the present invention is not described herein in detail.

The specific process of constructing the knowledge graph is described in the following embodiment 3, and therefore, the detailed description thereof is omitted.

In the above optional embodiment, the first identification information of the traffic cluster indicates a corresponding identifier in the traffic cluster, and the first identification information may be a server list, a domain name, record information, a plaintext common substring, and the like of the traffic cluster, which is not limited in this disclosure. The second identification information of the entity object indicates an identification corresponding to the entity object, and the second identification information may be a name, a developer, a domain name, and the like of the entity object, which is not limited in the present invention.

In the above optional embodiment, according to the first identification information and the preset knowledge graph, the entity objects corresponding to the one or more traffic clusters are determined, that is, the first identification information is instructed to be retrieved in the knowledge graph, according to the retrieval relationship between the first identification information and the second identification information, the entity object corresponding to the second identification information is associated with the traffic cluster corresponding to the first identification information, and then the entity objects corresponding to the one or more traffic clusters are determined.

It should be further noted that, in the above alternative embodiment, the determination of the target flow rate through the knowledge graph may be performed before extracting the feature information in the flow rate identification method in this embodiment, or after extracting the feature information, or before uploading the feature information to the DPI module, or after uploading the feature information to the DPI module, which is not limited by the present invention. Fig. 8 is a schematic flow chart of a flow identification method after flow determination is performed by introducing a knowledge graph according to an embodiment of the present invention, and an implementation manner in the above alternative embodiment is as shown in fig. 8.

In an optional embodiment, the determining, according to the first identification information and a preset knowledge graph, entity objects corresponding to one or more traffic clusters respectively includes:

determining second identification information corresponding to the first identification information in the knowledge graph according to the first identification information;

and taking the entity object corresponding to the second identification information as the entity object corresponding to the traffic cluster corresponding to the first identification information.

It should be further noted that, similar to the above clustering manner of the flows corresponding to different application layer protocols, the determining of the second identification information corresponding to the first identification information in the knowledge graph according to the first identification information may also be implemented differently according to the flows corresponding to different application layer protocols.

Specifically, for the traffic whose application layer protocol is http(s), a fixed field in a packet corresponding to the traffic cluster may be directly extracted, where the field is a domain name, and therefore, for the traffic whose application layer protocol is http(s), a domain name address may be directly extracted as the first identification information to perform retrieval in the knowledge graph, for example, using the following search statement: com, the potential entity object can be retrieved.

For the traffic with the non-http(s) application layer protocol, the domain name corresponding to the IP included in the traffic cluster and the filing information of the domain name can be acquired, and a certain amount of original messages can be collected for each server. Then, potential application names are obtained through searching the knowledge graph, and two searching and recommending modes are provided.

Firstly, server list information is used as first identification information, an application with an IP attribute as a server list element corresponding to a traffic cluster is inquired in a knowledge graph, and if the application exists, the application name is returned as an entity object of the traffic cluster. Under the condition that the retrieval is not finished, taking the domain name information as first identification information, inquiring an application name which is associated with a domain name corresponding to the traffic cluster and has an IP attribute, and if only one application is associated, returning the application name as an entity object of the traffic cluster; if the application names are associated with a plurality of application names or the application names are not associated with the application names, the record information is further used as first identification information, and the IP attribute is inquired to be the application name associated with the record information corresponding to the traffic cluster until the unique application name can be determined.

And under the condition that the unique application name cannot be determined, recording the traffic cluster and finally displaying the traffic cluster to operation and maintenance personnel for processing.

Secondly, the potential application names are obtained by sequentially traversing various query modes (such as applications with domain names xxxx.com; applications owned by developers with English names xxxx) in the knowledge graph by the plaintext public substrings of each traffic cluster so as to determine the entity objects of the traffic clusters.

Fig. 9 is a schematic flowchart of entity object retrieval under different application layer protocols according to an embodiment of the present invention. The above process of retrieving and determining entity objects for traffic of different application layer protocols is shown in fig. 9.

In an optional embodiment, the first identification information includes at least one of:

server list, domain name, record information, plaintext public substring.

Through the above description of the embodiments, those skilled in the art can clearly understand that the method according to the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but the former is a better implementation mode in many cases. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which is stored in a storage medium (e.g., ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal device (e.g., a mobile phone, a computer, a server, or a network device) to execute the method according to the embodiments of the present invention.

The flow identification method in the present embodiment is further described below by way of a specific embodiment:

detailed description of the preferred embodiment

In this embodiment, the target traffic is the traffic whose application layer protocol is http(s). As described in the above embodiments, the http(s) protocol has fields that can be directly used for application identity identification, such as the host, location, and User-Agent fields in the http protocol, and the server _ name field in the TLS handshake process of https, so that the host, location, User-Agent, and server _ name fields can be directly used as features for the traffic of the http(s) protocol. Therefore, the traffic of the http(s) protocol may or may not be subjected to clustering. The following describes the process of identifying and determining the target traffic:

s1, the DPI module identifies the flow of the http (S) protocol in the unknown flow, extracts the host, location and User-Agent fields in the interactive message in the http protocol flow and stores the fields in a database; extracting a server _ name field in a TLS interactive process in an interactive message in https protocol traffic and storing the server _ name field in a database;

s2, collecting information existing in the internet and constructing a knowledge graph according to the steps related to establishing a knowledge graph explained in the above embodiments.

And S3, aiming at the flow of the http protocol, carrying out domain name lookup by using a host field, and inquiring the application directly associated with the domain name. And if the query is not available, extracting the main domain name in the host field, and querying the application associated with the main domain name. After the application name is preliminarily determined, the User-Agent field is used to further decompose the flow from the browser or make access device subdivision.

S4, aiming at https protocol, because the whole message content is encrypted, the server _ name field filled in TLS handshake process is directly used for domain name search, application directly related to the server _ name is inquired, if no related application exists, the main domain name is extracted, application related to the main domain name is inquired, if multiple related applications exist, the domain name is automatically and directly searched through a search website, the analysis website is marked as official website or the content of an analysis Top page, and the analysis content is presented to operation and maintenance personnel for decision making.

And S5, once the operation and maintenance personnel make a decision for a certain domain name, the data is used for updating the knowledge graph to perfect the knowledge graph.

It should be further noted that, for the traffic of the http(s) protocol, one or more applications or internet of things devices corresponding to the traffic can be determined already through establishing the knowledge graph, and therefore, the above-mentioned related fields of the traffic corresponding to the applications or the internet of things devices can be directly processed as features without clustering.

And S6, performing feature verification on the application corresponding to the identified flow of the http (S) protocol by taking the host, the User-Agent or the server _ name as features, and issuing the application to the online DPI feature library if the verification is passed.

Detailed description of the invention

In the embodiment, the non-http(s) flow is used as the target flow; traffic of a non-http(s) protocol is mostly burst-type traffic, for example, internet of things equipment and the like. The following describes the process of identifying and determining the target traffic:

s1, the DPI module identifies target traffic of a non-http (S) protocol, extracts a plurality of triples (server IP, port, application layer protocol) and TCP/UDP loads for the traffic, collects partial original message data for each triplet and persists the original message data.

And S2, grouping and aggregating the data according to the triad group, and removing the duplication of the aggregated load data to prevent a large number of messages of the same type from being collected.

And S3, performing word segmentation on the load data in each triple by using a binary grammar (Bi-Gram), calculating the word frequency (TF) of each word in each triple, and normalizing to [0, 1] according to the word frequency distribution.

And S4, training word vectors by using the word segmentation of the triple load, wherein the coding mode of each word adopts the word vectors and the normalized TF as final coding, and the step considers the sequence between words, so that the result is better than the result obtained by directly using word frequency.

And S5, regarding the encoded triple load data as a document, measuring the cosine similarity between every two triplets, and constructing a similarity matrix.

And S6, converting the similarity matrix into an undirected weighted graph, wherein the weight is the similarity of the volume triples.

And S7, carrying out community division on the undirected weighted graph by using a community discovery algorithm, and dividing all triples into different traffic clusters, wherein each traffic cluster represents an unknown application.

S8, because the fixed threshold value is adopted for pruning by adopting the community discovery algorithm, the problem of unreasonable boundary processing can be caused, the boundary elements are processed by using human factors, the pruning elements which accord with the human factors are merged into the community, and the reasonability of clustering is improved.

S9, for each traffic cluster (including samples of multiple triplets), performing binary grammar (Bi-Gram) segmentation, word vector training, and DBSCAN clustering, and splitting the traffic cluster into N different sub-clusters (i.e., sub-traffic clusters in the above embodiment).

S10, extracting the byte features of the message for each subdivided cluster by using the feature extraction method described in the above embodiment.

And S11, distributing a unique protocol ID for the features, verifying the features, issuing the verification to an online DPI feature library, and re-extracting the modified parameters by verifying the verification.

S12, collecting information existing in the internet and constructing a knowledge graph using the method for establishing a knowledge graph set forth in the above embodiments.

And S13, extracting plaintext common substrings existing in each traffic cluster, and searching potential application or equipment names in the knowledge graph by combining a server list contained in the traffic cluster and a domain name corresponding to the server list.

And summarizing the information corresponding to a plurality of application/equipment names or the application or equipment names which are not related, displaying the information to the operation and maintenance personnel for decision making, and once the operation and maintenance personnel calibrate the result, updating the knowledge graph.

Example 2

The present embodiment further provides a flow rate determining method, and fig. 10 is a flowchart of the flow rate determining method provided according to the embodiment of the present invention, as shown in fig. 10, the flow rate determining method in the present embodiment includes:

s202, clustering the messages in the target flow according to the similarity degree of the messages in the target flow to obtain one or more flow clusters;

s204, acquiring first identification information of each flow cluster, and determining entity objects respectively corresponding to one or more flow clusters according to the first identification information and a preset knowledge graph to realize flow determination;

It should be further noted that the clustering process on the target traffic in step S202 and the determination on the traffic in step S204 correspond to the clustering process on the target traffic in step S102 in embodiment 1 and the alternative embodiment of the traffic determination in embodiment 1. Therefore, the remaining optional embodiments and technical effects in this embodiment refer to embodiment 1, and are not described herein again.

In an optional embodiment, in the step S202, performing clustering processing on the packets in the target traffic according to the similarity degree between the packets in the target traffic to obtain one or more traffic clusters includes:

and carrying out duplicate removal processing on the message in each triple.

In an optional embodiment, in the step S204, determining, according to the first identification information and a preset knowledge graph, entity objects respectively corresponding to one or more traffic clusters includes:

server list, domain name, record information, plaintext public substring.

Example 3

The embodiment provides a knowledge graph spectrum establishing method, and fig. 11 is a flowchart of the knowledge graph establishing method provided in the embodiment of the present invention, as shown in fig. 11, the knowledge graph spectrum establishing method in the embodiment includes:

s302, acquiring a data packet of the entity object, and establishing an entity relationship of the entity object according to the data packet of the entity object; the entity relationship comprises a mapping relationship between the entity object and second identification information corresponding to the entity object;

s304, establishing a knowledge graph according to the entity relation of the entity object.

It should be further noted that, in the foregoing embodiment, the entity object specifically indicates an application or an internet of things device, or a specific service in an application or an internet of things device.

In an optional embodiment, in step S302, the obtaining a data packet of the entity object includes:

and simulating the interactive operation between the entity object and the server to acquire a data packet of the entity object in the starting stage of the server.

It should be further noted that, the execution subject in the above optional embodiment may be a virtual machine, that is, the virtual machine is used to virtually run the physical object to simulate the interactive operation between the physical object and the server, so as to obtain the data packet of the physical object in the starting stage of the server. Specifically, the interaction between the entity object and the server is simulated through the virtual machine, and capturing processing is performed on the operation of the entity object in the starting stage, such as application installation, starting, packet capturing, advertisement filtering and the like, in the interaction process, so as to obtain a corresponding data packet. Fig. 12 is a schematic flowchart of a process for obtaining an entity object data packet according to an embodiment of the present invention, and the implementation in the foregoing alternative embodiment is as shown in fig. 12.

The above method can only obtain brief information of the entity object, such as IP, port data of the application server, and for http(s) message, field data of host, server _ name, and the like in the message can also be extracted. The data packet obtained in the above alternative embodiment cannot be identified or determined in the related art because the data packet is not authenticated, registered, and the like. However, in the above optional embodiment, since the data packet is used to construct the knowledge graph and further cooperates with the knowledge graph to determine the flow rate, the flow rate may be determined on the premise of avoiding detailed and tedious operations by the dial testing of the data packet implemented by the virtual machine in the optional embodiment.

In an optional embodiment, in the step S304, establishing an entity relationship of the entity object according to the data packet of the entity object, includes:

acquiring name information, developer information and domain name information of the entity object according to the data packet of the entity object;

and establishing the entity relationship of the entity object according to the corresponding relationship among the name information, the developer information and the domain name information.

It should be further noted that, in the foregoing optional embodiment, the entity relationship describes an association relationship between a flow and an entity related to the flow, and specifically, a triplet { (entity 1, entity relationship, entity 2) or (entity, attribute value) } formed by a host predicate may be used for representation. In the above optional embodiment, the name information of the entity object may include: the method comprises the following steps of (1) applying a name, a Chinese name, an English name, name pinyin, a category, an official network address, description information, a host field which can be obtained by dial testing, a server _ name field and a server IP list; the developer information of the entity object may include: mail box, telephone, industry, official network, filing address, etc.; the domain name information of the entity object may include: a main domain name, a domain name request page Title, a mapped IP address, etc.

Fig. 13 is a schematic diagram of entity relationships provided according to an embodiment of the present invention, and as shown in fig. 13, the entity relationships specifically include relationships between applications/devices and developers. The relationship between the developer and the application/device is labeled as Has, indicating that the developer Has developed a certain application/device. The application/device has the same developer or the same main domain name as the application/device, i.e. is developed by the same developer, labeled as bluetooth. The relationship between the application and the domain name is labeled domain is, which represents the domain name address bound to the application.

After the entity relationship is determined, the entity relationship can be written into a Neo4j database, visualization processing is carried out, and the domain knowledge graph is generated for information retrieval and recommendation.

Example 4

The flow rate identification device provided in this embodiment is used to implement the foregoing embodiments and preferred embodiments, and details are not repeated after the description. As used below, the term "module" may be a combination of software and/or hardware that implements a predetermined function. Although the means described in the embodiments below are preferably implemented in software, an implementation in hardware, or a combination of software and hardware is also possible and contemplated.

Fig. 14 is a block diagram of a flow rate identification device according to an embodiment of the present invention, and as shown in fig. 14, the flow rate identification device in the embodiment includes:

a first clustering module 402, configured to perform clustering processing on the packets in the target traffic according to the similarity degree between the packets in the target traffic to obtain one or more traffic clusters;

an extracting module 404, configured to perform feature extraction on each traffic cluster to obtain feature information corresponding to each traffic cluster; the characteristic information is used for indicating the characteristics of the messages included in the corresponding flow cluster;

the identifying module 406 is configured to upload a mapping relationship between one or more traffic clusters and corresponding feature information to the deep packet inspection DPI module, so that the DPI module performs traffic identification.

It should be further explained that other optional embodiments and technical effects of the flow rate identification apparatus in this embodiment correspond to the flow rate identification method in embodiment 1, and therefore are not described herein again.

In an optional embodiment, the clustering the packets in the target traffic according to the similarity degree between the packets in the target traffic to obtain one or more traffic clusters includes:

and carrying out duplicate removal processing on the message in each triple.

In an optional embodiment, the extracting the features of each traffic cluster to obtain the feature information corresponding to each traffic cluster includes:

In an optional embodiment, the sub-feature information includes at least one of: fixed domain features, enumerated domain features, keyword field features, regular expression features, length domain features;

In an optional embodiment, the determining the characteristic information corresponding to the traffic cluster according to one or more sub-characteristic information corresponding to one or more sub-traffic clusters includes:

In an optional embodiment, the uploading the mapping relationship between one or more traffic clusters and corresponding feature information to the deep packet inspection DPI module includes:

In an optional embodiment, the target traffic is traffic of the same client, or the target traffic is traffic of different clients.

In an optional embodiment, after the clustering the packets in the target traffic according to the similarity degree between the packets in the target traffic to obtain one or more traffic clusters, the method further includes:

server list, domain name, record information, plaintext public substring.

It should be noted that, the above modules may be implemented by software or hardware, and for the latter, the following may be implemented, but not limited to: the modules are all positioned in the same processor; alternatively, the modules are respectively located in different processors in any combination.

Example 5

The present embodiment provides a flow rate determining apparatus, which is used to implement the foregoing embodiments and preferred embodiments, and the description of the apparatus is omitted here. As used below, the term "module" may be a combination of software and/or hardware that implements a predetermined function. Although the means described in the embodiments below are preferably implemented in software, an implementation in hardware, or a combination of software and hardware is also possible and contemplated.

Fig. 15 is a block diagram of a flow rate determining apparatus according to an embodiment of the present invention, and as shown in fig. 15, the flow rate determining apparatus in the embodiment includes:

a second clustering module 502, configured to perform clustering processing on the packets in the target traffic according to the similarity degree between the packets in the target traffic to obtain one or more traffic clusters;

a determining module 504, configured to obtain first identification information of each traffic cluster, and determine, according to the first identification information and a preset knowledge graph, entity objects corresponding to one or more traffic clusters, so as to implement traffic identification;

It should be further explained that other optional embodiments and technical effects of the flow rate determination apparatus in this embodiment correspond to the flow rate determination method in embodiment 2, and therefore, no further description is provided herein.

and carrying out duplicate removal processing on the message in each triple.

server list, domain name, record information, plaintext public substring.

Example 6

The present embodiment provides a knowledge graph establishing apparatus, which is used to implement the foregoing embodiments and preferred embodiments, and the description of the apparatus is omitted here. As used below, the term "module" may be a combination of software and/or hardware that implements a predetermined function. Although the means described in the embodiments below are preferably implemented in software, an implementation in hardware, or a combination of software and hardware is also possible and contemplated.

Fig. 16 is a block diagram of a knowledge-graph creating apparatus according to an embodiment of the present invention, and as shown in fig. 16, the knowledge-graph creating apparatus in the embodiment includes:

an obtaining module 602, configured to obtain a data packet of an entity object, and establish an entity relationship of the entity object according to the data packet of the entity object; the entity relationship comprises a mapping relationship between the entity object and second identification information corresponding to the entity object;

the establishing module 604 is configured to establish a knowledge graph according to the entity relationship of the entity object.

It should be further explained that other optional embodiments and technical effects of the knowledge graph establishing apparatus in this embodiment correspond to the knowledge graph establishing method in embodiment 3, and therefore are not described herein again.

In an optional embodiment, the obtaining the data packet of the entity object includes:

In an optional embodiment, the establishing an entity relationship of the entity object according to the data packet of the entity object includes:

Example 7

Embodiments of the present invention also provide a computer-readable storage medium, in which a computer program is stored, wherein the computer program is configured to perform the steps of any of the above-mentioned method embodiments when executed.

Alternatively, in the present embodiment, the computer-readable storage medium may be configured to store a computer program for executing the method steps described in embodiment 1, embodiment 2, and embodiment 3.

Optionally, the specific examples in this embodiment may refer to the examples described in the above embodiments and optional implementation manners, and this embodiment is not described herein again.

Optionally, in this embodiment, the computer-readable storage medium may include, but is not limited to: various media capable of storing computer programs, such as a usb disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic disk, or an optical disk.

Example 8

Embodiments of the present invention also provide an electronic device comprising a memory having a computer program stored therein and a processor arranged to run the computer program to perform the steps of any of the above method embodiments.

Optionally, the electronic apparatus may further include a transmission device and an input/output device, wherein the transmission device is connected to the processor, and the input/output device is connected to the processor.

Alternatively, in this embodiment, the processor may be configured as a computer program for executing the method steps described in embodiment 1, embodiment 2, and embodiment 3.

It will be apparent to those skilled in the art that the modules or steps of the present invention described above may be implemented by a general purpose computing device, they may be centralized on a single computing device or distributed across a network of multiple computing devices, and alternatively, they may be implemented by program code executable by a computing device, such that they may be stored in a storage device and executed by a computing device, and in some cases, the steps shown or described may be performed in an order different than that described herein, or they may be separately fabricated into individual integrated circuit modules, or multiple ones of them may be fabricated into a single integrated circuit module. Thus, the present invention is not limited to any specific combination of hardware and software.

The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A traffic identification method, comprising:

2. The method according to claim 1, wherein the clustering the packets in the target traffic according to the similarity degree between the packets in the target traffic to obtain one or more traffic clusters comprises:

clustering the one or more triples according to the similarity degree between the messages in any two triples to obtain one or more triples sets; wherein each of the triple sets is used to indicate one of the traffic clusters.

3. The method according to claim 2, wherein before clustering the one or more triples according to the similarity between the packets in any two triples, the method further comprises:

and carrying out duplicate removal processing on the message in each triple.

4. The method according to claim 2, wherein before clustering the one or more triples according to the similarity between the packets in any two triples, the method further comprises:

clustering the messages in each triple according to message types to obtain one or more message type groups corresponding to each triple; the one or more packet type groups are used for indicating the packets corresponding to the one or more packet types in the triple group respectively;

extracting the same number of sample messages from each message type group, and re-determining the triples corresponding to the one or more message type groups according to the message samples corresponding to the one or more message type groups.

5. The method of claim 4, wherein the clustering the packets in each of the triples according to packet types comprises:

6. The method of claim 2, wherein the clustering the one or more triples according to the similarity between the packets in any two triples to obtain one or more triple sets comprises:

acquiring a similarity matrix according to a preset similarity measurement mode, wherein the similarity matrix is used for indicating the similarity degree between the messages in any two triples;

and carrying out community division processing on the undirected graph to obtain the one or more triple sets.

7. The method according to claim 6, wherein the community partitioning the undirected graph to obtain the one or more triple sets comprises:

performing a community partitioning process on the undirected graph to determine one or more first boundary values;

the first boundary value is modified to determine one or more second boundary values, and the one or more triple sets are determined based on the one or more second boundary values.

8. The method according to claim 1, wherein the performing feature extraction on each of the traffic clusters to obtain feature information corresponding to each of the traffic clusters comprises:

clustering the triples in each flow cluster according to the characteristic type to obtain one or more sub-flow clusters corresponding to each flow cluster;

extracting one or more pieces of sub-feature information in each sub-flow cluster according to a preset feature extraction mode;

and determining the characteristic information corresponding to the traffic cluster according to the one or more sub-characteristic information corresponding to the one or more sub-traffic clusters.

9. The method of claim 8, wherein the clustering the triplets in each of the traffic clusters according to a characteristic type to obtain one or more sub-traffic clusters corresponding to each of the traffic clusters comprises:

10. The method of claim 8, wherein the sub-feature information comprises at least one of: fixed domain features, enumerated domain features, keyword field features, regular expression features, length domain features;

the extracting one or more pieces of sub-feature information in each sub-traffic cluster according to a preset feature extraction mode includes:

extracting keyword field characteristics in the sub-flow clusters according to a common sub-string algorithm, wherein the keyword field characteristics are used for indicating keywords existing in each message in the sub-flow clusters; alternatively, the first and second electrodes may be,

extracting regular expression features in the sub-flow clusters according to a multi-sequence alignment algorithm, wherein the regular expression features are used for indicating regular expressions of messages in the sub-flow clusters; alternatively, the first and second electrodes may be,

and extracting the length domain characteristics of the sub-flow cluster according to the relation of the length of the byte domain at the specified position, wherein the length domain characteristics are used for indicating the length of the message in the sub-flow cluster.

11. The method of claim 8, wherein the determining the characteristic information corresponding to the traffic cluster according to the one or more sub-characteristic information corresponding to the one or more sub-traffic clusters comprises:

determining the characteristic information corresponding to the traffic cluster according to the one or more sub-characteristic information corresponding to the one or more sub-traffic clusters; wherein the feature information includes the one or more sub-feature information corresponding to at least one of the sub-traffic clusters.

12. The method of claim 1, wherein uploading the mapping relationship between the one or more traffic clusters and the corresponding feature information to a Deep Packet Inspection (DPI) module comprises:

and verifying the characteristic information corresponding to the one or more traffic clusters, and uploading the mapping relation between the traffic clusters passing the verification and the corresponding characteristic information to the DPI module.

13. The method of claim 12, wherein the verifying the characteristic information corresponding to the one or more traffic clusters comprises:

performing feature verification on each triple in the traffic cluster corresponding to the feature information according to the feature information to determine the detection rate of the feature information;

and verifying the characteristic information according to the relation between the detection rate and a preset detection threshold value and the relation between the false detection rate and a preset false detection threshold value.

14. The method of claim 1, wherein the target traffic is traffic of the same client or the target traffic is traffic of different clients.

15. The method according to any one of claims 1 to 14, wherein after clustering the packets in the target traffic according to the similarity degree between the packets in the target traffic to obtain one or more traffic clusters, the method further comprises:

16. The method according to claim 15, wherein the determining, according to the first identification information and a preset knowledge graph, entity objects respectively corresponding to the one or more traffic clusters comprises:

17. The method of claim 15, wherein the first identification information comprises at least one of:

server list, domain name, record information, plaintext public substring.

18. A method for determining flow, comprising:

19. The method according to claim 18, wherein the clustering the packets in the target traffic according to the similarity degree between the packets in the target traffic to obtain one or more traffic clusters comprises:

20. The method according to claim 19, wherein before clustering the one or more triples according to the similarity between the packets in any two triples, the method further comprises:

and carrying out duplicate removal processing on the message in each triple.

21. The method according to claim 19, wherein before clustering the one or more triples according to the similarity between the packets in any two triples, the method further comprises:

22. The method of claim 21, wherein clustering the packets in each of the triples according to packet type comprises:

23. The method of claim 19, wherein the clustering the one or more triples according to the similarity between the packets in any two triples to obtain one or more triple sets comprises:

24. The method of claim 23, wherein the community partitioning the undirected graph to obtain the one or more sets of triples comprises:

25. The method according to any one of claims 18 to 24, wherein the determining, according to the first identification information and a preset knowledge graph, entity objects respectively corresponding to the one or more traffic clusters comprises:

26. The method of claim 18, wherein the first identification information comprises at least one of:

server list, domain name, record information, plaintext public substring.

27. A knowledge graph establishing method is characterized by comprising the following steps:

28. The method of claim 27, wherein the obtaining the data packet of the entity object comprises:

and simulating the interactive operation between the entity object and a server to acquire the data packet of the starting stage of the entity object in the server.

29. The method according to claim 27, wherein the establishing the entity relationship of the entity object according to the data packet of the entity object comprises:

30. A flow rate identification device, comprising:

31. A flow rate determination device, comprising:

32. A knowledge-graph building apparatus, comprising:

33. A computer-readable storage medium, in which a computer program is stored, wherein the computer program is configured to perform the method of any one of claims 1 to 17, 18 to 26, 27 to 29 when executed.

34. An electronic device comprising a memory and a processor, wherein the memory has stored therein a computer program, and the processor is configured to execute the computer program to perform the method of any one of claims 1 to 17, 18 to 26, and 27 to 29.