CN111737371B - Data flow detection classification method and device capable of dynamically predicting - Google Patents

Data flow detection classification method and device capable of dynamically predicting Download PDF

Info

Publication number
CN111737371B
CN111737371B CN202010855720.1A CN202010855720A CN111737371B CN 111737371 B CN111737371 B CN 111737371B CN 202010855720 A CN202010855720 A CN 202010855720A CN 111737371 B CN111737371 B CN 111737371B
Authority
CN
China
Prior art keywords
training sample
training
data
metadata
classification
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010855720.1A
Other languages
Chinese (zh)
Other versions
CN111737371A (en
Inventor
杨贻宏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Feiqi Network Technology Co ltd
Original Assignee
Shanghai Feiqi Network Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Feiqi Network Technology Co ltd filed Critical Shanghai Feiqi Network Technology Co ltd
Priority to CN202010855720.1A priority Critical patent/CN111737371B/en
Publication of CN111737371A publication Critical patent/CN111737371A/en
Application granted granted Critical
Publication of CN111737371B publication Critical patent/CN111737371B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases
    • G06F16/285Clustering or classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the application provides a data flow detection classification method and device capable of dynamically predicting, and a training sample is obtained by analyzing multi-source heterogeneous big data. Then, the training samples are divided according to the label types, and the flow session feature vector and the regression model prediction vector of each divided training sample are calculated. By means of the flow session feature vector and regression model prediction vector hybrid deep learning, data characteristics of multi-source heterogeneous data are considered, intelligent identification and fine classification can be performed on the multi-source heterogeneous data flow, capacity support is provided for big data service and application aggregation, and deep mining of big data value is achieved.

Description

Data flow detection classification method and device capable of dynamically predicting
Technical Field
The application relates to the technical field of data traffic detection and classification, in particular to a data traffic detection and classification method and device capable of dynamically predicting.
Background
Currently, real-time analysis application scenarios for multi-source heterogeneous mass data are becoming more and more common, and how to perform intelligent identification and refined classification on unknown multi-source data streams so as to provide real-time analysis processing of big data for subsequent business operations is a big problem in the field. In the conventional scheme, intelligent identification and fine classification can be usually performed only on data streams of a single source, and intelligent identification and fine classification on data streams of multiple sources are difficult to perform, so that the service operation scene is limited.
Disclosure of Invention
In view of this, an object of the present application is to provide a data traffic detection and classification method and apparatus capable of dynamically predicting, which can perform intelligent identification and fine classification on a multi-source data stream, provide capability support for aggregation of big data services and applications, and implement deep mining of big data value.
According to a first aspect of the present application, there is provided a dynamically-predictable data traffic detection and classification method, applied to a server, the method including:
the method comprises the steps of obtaining multi-source heterogeneous big data for data flow detection and classification, analyzing the multi-source heterogeneous big data, and obtaining a training sample;
shunting the training samples according to the label types, and calculating the flow session feature vector and the regression model prediction vector of each shunted training sample;
inputting the stream session feature vector, the regression model prediction vector and the label of each training sample into a data traffic detection classification model for training to obtain a trained target data traffic detection classification model;
and detecting and classifying the multi-source heterogeneous data flow to be classified according to the target data flow detection and classification model.
In a possible implementation manner of the first aspect, the step of obtaining multi-source heterogeneous big data for data flow detection and classification, analyzing the multi-source heterogeneous big data, and obtaining a training sample includes:
acquiring multi-source heterogeneous big data for data flow detection and classification;
modeling a directed weighted graph of the multi-source heterogeneous big data, representing entity attributes through a vertex of the directed weighted graph, representing relationships among the entity attributes through edges of the directed weighted graph, wherein the entity attributes are used for representing metadata objects of each data node in the multi-source heterogeneous big data, the relationships among the entity attributes are used for representing metadata relationships, and each metadata object is used as a data field in a relational database;
taking the metadata object generated by the directed weighted graph as a metadata object dictionary, and eliminating the precursor relation and the successor relation of each metadata object in the directed weighted graph, which are not related to the candidate metadata relation in the metadata object dictionary, so as to obtain a legal metadata relation;
and taking each metadata object generated by the directed weighted graph and a legal metadata relation corresponding to each metadata object as the training sample.
In a possible implementation manner of the first aspect, the step of removing predecessor relations and successor relations of each metadata object in the directed weighted graph, which are not associated with the candidate metadata relations in the metadata object dictionary, to obtain a legal metadata relation includes:
judging whether the predecessor relationship and successor relationship of each metadata object in the directed weighted graph are matched with at least one candidate metadata relationship in the metadata object dictionary;
and when the predecessor relationship and successor relationship of any metadata object are not matched with at least one candidate metadata relationship in the metadata object dictionary, removing the predecessor relationship and successor relationship of the metadata object to obtain a legal metadata relationship.
In a possible implementation manner of the first aspect, the step of calculating the streaming session feature vector and the regression model prediction vector of each shunted training sample includes:
calculating a state transition list of each training sample after shunting, performing space compression on the state transition list of each training sample, dividing the state transition list into a plurality of mutually disjoint subsets, and performing coding operation by using different alphabet recoding aiming at each subset to obtain coding characteristic information of each subset;
combining similar coding feature information in the coding feature information of each subset through a state transition edge corresponding to the label type to obtain a streaming session feature vector of each training sample after shunting;
and performing regression model analysis on the flow session feature vector of each shunted training sample to obtain a regression model prediction vector of each shunted training sample.
In a possible implementation manner of the first aspect, the step of merging similar encoding feature information in the encoding feature information of each subset through a state transition edge corresponding to the label type to obtain a streaming session feature vector of each training sample after being shunted includes:
identifying a state transition matrix in the coding characteristic information of each subset through a state transition edge corresponding to the label type;
acquiring a target state transition matrix with the same state transition parameters, and determining coding characteristic information corresponding to the target state transition matrix as similar coding characteristic information;
and combining the similar coding feature information in the coding feature information of each subset to obtain the streaming session feature vector of each training sample after shunting.
In a possible implementation manner of the first aspect, the step of inputting the streaming session feature vector, the regression model prediction vector, and the label of each training sample into a data traffic detection classification model for training to obtain a trained target data traffic detection classification model includes:
respectively establishing a corresponding first initialization weight and a second initialization weight for the stream session feature vector and the regression model prediction vector of each training sample;
inputting the first initialization weight and the second initialization weight into a data flow detection classification model, training a weak regression operator, and evaluating a training error of a label to which each training sample belongs by using the weak regression operator according to the label type;
selecting a corresponding error coefficient according to the training error to adjust the weak regression operator, and updating the weight distribution in each training sample;
judging whether the renewed weight distribution in each training sample meets a training end condition, when the training end condition is not met, iterating the training process until the training end condition is met, and obtaining an output result of the weak regression operator for each training sample, wherein the output result comprises a mean square error value, an inverse signal-to-noise ratio value and a maximum error value;
and updating the network parameters of the data traffic detection classification model according to the mean square error value, the inverse signal-to-noise ratio value and the maximum error value to obtain the trained target data traffic detection classification model.
According to another aspect of the present application, there is also provided a data traffic detection and classification apparatus capable of dynamically predicting, which is applied to a server, the apparatus including:
the acquisition module is used for acquiring multi-source heterogeneous big data for data flow detection and classification, analyzing the multi-source heterogeneous big data and acquiring a training sample;
the flow distribution calculation module is used for distributing the training samples according to the label types and calculating the flow session characteristic vector and the regression model prediction vector of each training sample after distribution;
the training module is used for inputting the streaming session feature vector, the regression model prediction vector and the label type of each training sample into a data traffic detection classification model for training to obtain a trained target data traffic detection classification model;
and the classification module is used for detecting and classifying the multi-source heterogeneous data traffic to be classified according to the target data traffic detection and classification model.
Based on any aspect, the training samples are obtained by analyzing multi-source heterogeneous big data, then the training samples are shunted according to the label types, and the flow session feature vector and the regression model prediction vector of each shunted training sample are calculated. The stream session feature vector and the regression model prediction vector are considered for deep learning, so that the data characteristics of multi-source heterogeneous data are considered, intelligent identification and fine classification can be performed on the multi-source data stream, capability support is provided for big data service and application aggregation, and deep mining of big data value is realized.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are required to be used in the embodiments will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present application and therefore should not be considered as limiting the scope, and for those skilled in the art, other related drawings can be obtained from the drawings without inventive effort.
Fig. 1 is a schematic diagram illustrating an application scenario of a data traffic detection and classification system provided in an embodiment of the present application;
FIG. 2 is a schematic flow chart illustrating a method for dynamically predicting data traffic detection and classification according to an embodiment of the present disclosure;
FIG. 3 illustrates a schematic diagram of a directed weighted graph as provided by an embodiment of the present application;
FIG. 4 is a schematic diagram illustrating functional modules of a dynamically-predictable data traffic detection and classification apparatus provided in an embodiment of the present application;
fig. 5 is a schematic component structural diagram of a server for performing the above-described dynamically-predictable data traffic detection and classification method according to an embodiment of the present application.
Detailed Description
In order to make the purpose, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it should be understood that the drawings in the present application are for illustrative and descriptive purposes only and are not used to limit the scope of protection of the present application.
Fig. 1 is a schematic diagram illustrating an application scenario of a data traffic detection and classification system 10 according to an embodiment of the present application. In this embodiment, the data traffic detection and classification system 10 may include a server 100 and a user terminal 200 communicatively connected to the server 100.
User terminal 200 may include, but is not limited to, a mobile device, a tablet computer, a laptop computer, or any combination of two or more thereof.
In other possible embodiments, the data traffic detection and classification system 10 may also include only a portion of the components shown in fig. 1 or may also include other components.
For example, the server 100 may be a single server or a server group. The set of servers may be centralized or distributed (e.g., server 100 may be a distributed system).
Fig. 2 is a flowchart illustrating a flow chart of a dynamically-predictable data traffic detection and classification method provided in an embodiment of the present application, where the dynamically-predictable data traffic detection and classification method may be executed by the server 100 shown in fig. 1 in this embodiment. It should be understood that, in other embodiments, the order of some steps in the dynamic predictive data traffic detection and classification method of the present embodiment may be interchanged according to actual needs, or some steps may be omitted or deleted. The detailed steps of the data traffic detection and classification method capable of dynamically predicting are described as follows.
And step S110, obtaining multi-source heterogeneous big data for data flow detection and classification, analyzing the multi-source heterogeneous big data, and obtaining a training sample.
And step S120, shunting the training samples according to the label types, and calculating the flow session feature vector and the regression model prediction vector of each shunted training sample.
Step S130, inputting the stream session feature vector, the regression model prediction vector and the label type of each training sample into a data traffic detection classification model for training to obtain a trained target data traffic detection classification model.
And step S140, detecting and classifying the multi-source heterogeneous data traffic to be classified according to the target data traffic detection and classification model.
In this embodiment, due to the influence of factors such as the stage, the technology, and other human factors of the construction and implementation of different service systems, a large amount of service data adopting different storage modes may be accumulated, including the data management modes and the data reading modes that are adopted, and these different storage modes, different data management modes, different data reading modes, and the like form a heterogeneous data source. In this embodiment, first, multi-source heterogeneous big data for data flow detection and classification may be obtained from these heterogeneous data sources.
The inventor finds that, for the multi-source heterogeneous big data, the key characteristic parts of the multi-source heterogeneous big data are the stream session characteristic vector and the regression model prediction vector, so that the embodiment obtains the training samples by analyzing the multi-source heterogeneous big data, then divides the training samples according to the label types, and calculates the stream session characteristic vector and the regression model prediction vector of each divided training sample. By considering the mixed deep learning of the stream session feature vector and the regression model prediction vector, the data characteristics of multi-source heterogeneous data are considered, intelligent identification and refined classification can be performed on the multi-source data stream, capability support is provided for big data service and application aggregation, and deep mining of big data value is realized.
In a possible implementation manner, for step S110, in a large-scale distributed system, data is often distributed in a plurality of data sources, data storage manners adopted by the data sources are different, and each data source reads, uses, updates, maintains, analyzes and the like data in the data source by different components and services. That is, data associated with the same entity in a large-scale system is relatively distributed, heterogeneous. When metadata management is performed, for an entity, metadata related to the entity needs to be collected from a plurality of data sources, and complexity caused by heterogeneous data sources is overcome. Before training on the multi-source heterogeneous data extraction features, for multiple entities with relationships, metadata related to all the entities needs to be collected, and association relationships among the entities need to be captured successfully.
Due to the multi-source and heterogeneous characteristics of the data, tasks such as data quality management and the like are difficult to master complete information of the data, conflicts among the data are solved by using the incidence relation among the data, and the correctness and the consistency of the data are ensured. During research, the inventor finds that according to the traditional idea, a globally unique data source can be established, the data source collects data of all data sources from an original system, converts the data into a standard form and stores the data in the data source. Metadata is discovered from the data source, and tasks such as data verification and data cleaning can acquire data from the data source. However, in a practical large-scale system, the data size is large (for example, PB level or even EB level is common), the data updating speed is fast (TB level data is common to increase every day), a single data source has difficulty in storing the data at the data level and providing access capability at the same time, the fast data increase cannot be handled, and it is difficult to find a representation containing all heterogeneous forms to describe the original data. Therefore, step S110 will be exemplarily described below, so that metadata discovery is performed on multi-source heterogeneous big data to obtain a training sample. In detail, step S110 may be embodied by the following exemplary sub-steps S111-S114, described in detail below.
And a substep S111, obtaining multi-source heterogeneous big data for data flow detection and classification.
And a substep S112, modeling the directed weighted graph of the multi-source heterogeneous big data, representing entity attributes through the vertexes of the directed weighted graph, and representing the relationship between the entity attributes through the edges of the directed weighted graph.
And a substep S113, taking the metadata object generated by the directed weighted graph as a metadata object dictionary, and eliminating candidate metadata relations in the metadata object dictionary, which do not relate to the predecessor relation and successor relation of each metadata object in the directed weighted graph, so as to obtain a legal metadata relation.
And a substep S114, using each metadata object generated by the directed weighted graph and a legal metadata relationship corresponding to each metadata object as a training sample.
In this embodiment, the entity attributes are used to represent metadata objects of each data node in the multi-source heterogeneous big data, the relationship between the entity attributes is used to represent metadata relationships, and each metadata object is used as a data field in the database. Wherein, referring to fig. 3, P1-P11 can be understood as an entity attribute, and the edges between P1 and P5, P2 and P5, P5 and P7, P7 and P9, P3 and P6, etc. can be understood as the relationships between P1 and P5, P2 and P5, P5 and P7, P7 and P9, P3 and P6, etc. Wherein the direction of the arrow may represent the direction of the relationship. The relationship between metadata objects may be many-to-many, that is, the same metadata object may be processed by some methods to obtain multiple metadata objects, and the same metadata object may also be processed by multiple metadata objects in common. The metadata relationship is unidirectional, i.e., a certain metadata object cannot be processed for a limited number of times to obtain the original metadata object.
In sub-step S112, the present embodiment may determine whether the predecessor and successor relationships of each metadata object in the directed weighted graph match at least one candidate metadata relationship in the metadata object dictionary. And when the predecessor relationship and successor relationship of any metadata object are not matched with at least one candidate metadata relationship in the metadata object dictionary, removing the predecessor relationship and successor relationship of the metadata object to obtain a legal metadata relationship. For example, if data attribute 1 of metadata object a is processed to obtain data attribute 2, then there is a metadata relationship R between data attribute 1 and data attribute 2, data attribute 1 is called the predecessor relationship of metadata object a, and data attribute 2 is called the successor relationship of metadata object a.
In this way, each metadata object generated by the directed weighted graph and the legal metadata relationship corresponding to each metadata object can be used as a training sample, so that metadata of multi-source heterogeneous big data is mined to obtain the training sample.
In a possible implementation manner, in order to implement efficient sample training traffic and thus quickly perform deep recognition on massive training sample data, step S120 may be specifically implemented by the following exemplary sub-steps S121 to S123, which are described in detail below.
And a substep S121, calculating a state transition list of each training sample after the branching, performing spatial compression on the state transition list of each training sample, dividing the state transition list into a plurality of mutually disjoint subsets, and performing coding operation by using different alphabet recoding for each subset to obtain coding feature information of each subset.
In this embodiment, the state transition table column may be regarded as a two-dimensional matrix, rows of the matrix represent the state of each training sample, and columns of the matrix represent the input alphabet recoding, that is, the state transition table column may represent a corresponding mapping relationship between the state of each training sample and the input alphabet recoding. Illustratively, the state of each training sample may refer to, but is not limited to, a session state of the training sample, such as a session state each time a streaming session is initiated (e.g., but not limited to, a video session state, a voice session state, a text session state, etc.).
And a substep S122, combining the similar coding characteristic information in the coding characteristic information of each subset through the state transition edge corresponding to the label type to obtain the streaming session characteristic vector of each training sample after shunting.
And S123, performing regression model analysis on the flow session feature vector of each shunted training sample to obtain a regression model prediction vector of each shunted training sample.
Exemplarily, in the substep S122, a state transition matrix in the coding feature information of each subset may be identified by a state transition edge corresponding to a tag type, and then a target state transition matrix with the same state transition parameter is obtained, and the coding feature information corresponding to the target state transition matrix is determined to be similar coding feature information, so that the similar coding feature information in the coding feature information of each subset may be merged to obtain a streaming session feature vector of each training sample after being split.
Illustratively, in the process of identifying the state transition matrix in the coding feature information of each subset by using the state transition edge corresponding to the label type, the label type may refer to a label type pre-labeled by a user in the training sample, and a classification label used for embodying the training sample, where the state transition edge corresponds to the classification label one to one, and may specifically represent an edge formed by a state transition starting point and a state transition end point of the classification label, and for example, the state transition starting point and the state transition end point may respectively represent a state service before transition (e.g., a text service) and a state service after transition (e.g., a video service) that the classification label may allow. In the above step, specifically, the following steps may be performed: and matching nodes matched with the state transition edges in the coding characteristic information of each subset through the state transition starting point and the state transition end point of the state transition edge corresponding to the label type, and respectively arranging the matched nodes into state transition matrixes according to the classification of state services.
In one possible implementation, step S130 can be embodied by sub-step S131-sub-step S135 as illustrated below, which is described in detail below.
In the substep S131, a first initialization weight and a second initialization weight corresponding to the stream session feature vector and the regression model prediction vector of each training sample are respectively established.
For example, the corresponding first initialization weight and second initialization weight may be respectively established by the network elements for the flow session feature vector and the regression model prediction vector in the data traffic detection classification model according to the vector value of the flow session feature vector and the vector value of the regression model prediction vector of each training sample.
And a substep S132, inputting the first initialization weight and the second initialization weight into the data flow detection classification model, training a weak regression operator, and evaluating the training error of the label type to which each training sample belongs by using the weak regression operator according to the label type.
And a substep S133 of selecting a corresponding error coefficient according to the training error to adjust the weak regression operator, and updating the weight distribution in each training sample again.
And a substep S134, judging whether the weight distribution in each newly updated training sample meets a training end condition, and when the training end condition is not met, iterating the training process until the training end condition is met to obtain the output result of the weak regression operator for each training sample. The output result may include a mean square error value, an inverse signal-to-noise value, and a maximum error value.
And a substep S135, updating the network parameters of the data traffic detection classification model according to the mean square error value, the inverse signal-to-noise ratio value and the maximum error value to obtain a trained target data traffic detection classification model.
For example, the data traffic detection classification model may be subjected to back propagation training according to a mean square error value, an inverse signal-to-noise ratio value, and a maximum error value in a stochastic gradient descent manner, so as to update network parameters of the data traffic detection classification model.
Based on the same inventive concept, please refer to fig. 4, which shows a schematic diagram of functional modules of the data traffic detection and classification apparatus 110 capable of dynamically predicting according to the embodiment of the present application, and the embodiment can divide the functional modules of the data traffic detection and classification apparatus 110 capable of dynamically predicting according to the above method embodiment. For example, the functional blocks may be divided for the respective functions, or two or more functions may be integrated into one processing block. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode. It should be noted that, in the embodiment of the present application, the division of the module is schematic, and is only one logic function division, and there may be another division manner in actual implementation. For example, in the case of dividing each function module according to each function, the data traffic detection and classification apparatus 110 capable of dynamically predicting shown in fig. 4 is only a schematic apparatus diagram. The data traffic detection and classification device 110 capable of dynamically predicting may include an obtaining module 111, a split-flow calculating module 112, a training module 113, and a classification module 114, and the functions of the functional modules of the data traffic detection and classification device 110 capable of dynamically predicting are described in detail below.
The obtaining module 111 is configured to obtain multi-source heterogeneous big data for data flow detection and classification, and analyze the multi-source heterogeneous big data to obtain a training sample. It is understood that the obtaining module 111 can be used to execute the step S110, and for the detailed implementation of the obtaining module 111, reference can be made to the content related to the step S110.
And the shunting calculation module 112 is configured to shunt the training samples according to the label types, and calculate a streaming session feature vector and a regression model prediction vector of each shunted training sample. It is understood that the diversion calculation module 112 can be configured to perform the step S120, and for the detailed implementation of the diversion calculation module 112, reference can be made to the above-mentioned contents related to the step S120.
And the training module 113 is configured to input the streaming session feature vector, the regression model prediction vector, and the label type of each training sample into the data traffic detection classification model for training, so as to obtain a trained target data traffic detection classification model. It is understood that the training module 113 may be configured to perform the step S130, and for the detailed implementation of the training module 113, reference may be made to the content related to the step S130.
And the classification module 114 is configured to detect and classify the multi-source heterogeneous data traffic to be classified according to the target data traffic detection and classification model. It is understood that the classification module 114 can be used to perform the step S140, and for the detailed implementation of the classification module 114, reference can be made to the above description regarding the step S140.
In a possible implementation, the obtaining module 111 is specifically configured to:
and acquiring multi-source heterogeneous big data for data flow detection and classification.
Modeling is carried out on the directed weighted graph of the multi-source heterogeneous big data, entity attributes are represented through the top points of the directed weighted graph, the relation among the entity attributes is represented through the edges of the directed weighted graph, the entity attributes are used for representing metadata objects of each data node in the multi-source heterogeneous big data, the relation among the entity attributes is used for representing metadata relations, and each metadata object is used as a data field in the database.
And taking the metadata object generated by the directed weighted graph as a metadata object dictionary, and removing candidate metadata relations in the metadata object dictionary, which are not associated with the predecessor relation and successor relation of each metadata object in the directed weighted graph, so as to obtain a legal metadata relation.
And taking each metadata object generated by the directed weighted graph and a legal metadata relation corresponding to each metadata object as a training sample.
In a possible implementation, the obtaining module 111 is specifically configured to:
determining whether a predecessor relationship and a successor relationship of each metadata object in the directed weighted graph match at least one candidate metadata relationship in the metadata object dictionary.
And when the predecessor relationship and successor relationship of any metadata object are not matched with at least one candidate metadata relationship in the metadata object dictionary, removing the predecessor relationship and successor relationship of the metadata object to obtain a legal metadata relationship.
In a possible implementation, the split calculation module 112 is specifically configured to:
calculating a state transition list of each training sample after shunting, performing space compression on the state transition list of each training sample, dividing the state transition list into a plurality of mutually disjoint subsets, and performing coding operation by using different alphabet recoding aiming at each subset to obtain coding characteristic information of each subset.
And combining the similar coding characteristic information in the coding characteristic information of each subset through the state transition edge corresponding to the label type to obtain the streaming session characteristic vector of each training sample after shunting.
And performing regression model analysis on the flow session feature vector of each shunted training sample to obtain a regression model prediction vector of each shunted training sample.
In a possible implementation, the split calculation module 112 is specifically configured to:
and identifying the state transition matrix in the coding characteristic information of each subset through the state transition edge corresponding to the label type.
And acquiring a target state transition matrix with the same state transition parameters, and determining the coding characteristic information corresponding to the target state transition matrix as similar coding characteristic information.
And combining the similar coding feature information in the coding feature information of each subset to obtain the streaming session feature vector of each training sample after shunting.
In a possible implementation, the training module 113 is specifically configured to:
and respectively establishing a corresponding first initialization weight and a second initialization weight for the stream session feature vector and the regression model prediction vector of each training sample.
And inputting the first initialization weight and the second initialization weight into a data flow detection classification model, training a weak regression operator, and evaluating a training error of a label to which each training sample belongs by using the weak regression operator according to the label type.
And selecting a corresponding error coefficient according to the training error to adjust the weak regression operator, and updating the weight distribution in each training sample.
And judging whether the weight distribution in each renewed training sample meets a training end condition, when the training end condition is not met, iterating the training process until the training end condition is met, and obtaining an output result of the weak regression operator for each training sample, wherein the output result comprises a mean square error value, an inverse signal-to-noise ratio value and a maximum error value.
And updating network parameters of the data traffic detection classification model according to the mean square error value, the inverse signal-to-noise ratio value and the maximum error value to obtain the trained target data traffic detection classification model.
Based on the same inventive concept, please refer to fig. 5, which shows a schematic block diagram of a server 100 for executing the above-mentioned dynamically-predictable data traffic detection and classification method provided in an embodiment of the present application, where the server 100 may include a dynamically-predictable data traffic detection and classification apparatus 110, a machine-readable storage medium 120, and a processor 130.
In this embodiment, the machine-readable storage medium 120 and the processor 130 are both located in the server 100 and are separately located. However, it should be understood that the machine-readable storage medium 120 may be separate from the server 100 and may be accessed by the processor 130 through a bus interface. Alternatively, the machine-readable storage medium 120 may be integrated into the processor 130, e.g., may be a cache and/or general purpose registers.
The dynamically predictable data traffic detection and classification device 110 may include software functional modules (such as the obtaining module 111, the diversion calculation module 112, the training module 113, and the classification module 114 shown in fig. 4) stored in the machine readable storage medium 120, when the processor 130 executes the software functional modules in the dynamically predictable data traffic detection and classification device 110, so as to implement the dynamically predictable data traffic detection and classification method provided by the foregoing method embodiments.
Since the server 100 provided in the embodiment of the present application is another implementation form of the method embodiment executed by the server 100, and the server 100 may be configured to execute the data traffic detection and classification method capable of dynamically predicting provided in the method embodiment, reference may be made to the method embodiment for obtaining technical effects, and details are not repeated here.
The above description is only for various embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive of changes or substitutions within the technical scope of the present application, and all such changes or substitutions are included in the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims (6)

1. A data traffic detection and classification method capable of dynamic prediction is applied to a server, and comprises the following steps:
the method comprises the steps of obtaining multi-source heterogeneous big data for data flow detection and classification, analyzing the multi-source heterogeneous big data, and obtaining a training sample;
shunting the training samples according to the label types, and calculating the flow session feature vector and the regression model prediction vector of each shunted training sample;
inputting the stream session feature vector, the regression model prediction vector and the label of each training sample into a data traffic detection classification model for training to obtain a trained target data traffic detection classification model;
detecting and classifying the multi-source heterogeneous data traffic to be classified according to the target data traffic detection and classification model;
the method comprises the steps of obtaining multi-source heterogeneous big data for data flow detection and classification, analyzing the multi-source heterogeneous big data, and obtaining a training sample, wherein the steps comprise:
acquiring multi-source heterogeneous big data for data flow detection and classification;
modeling a directed weighted graph of the multi-source heterogeneous big data, representing entity attributes through a vertex of the directed weighted graph, representing relationships among the entity attributes through edges of the directed weighted graph, wherein the entity attributes are used for representing metadata objects of each data node in the multi-source heterogeneous big data, the relationships among the entity attributes are used for representing metadata relationships, and each metadata object is used as a data field in a database;
taking the metadata object generated by the directed weighted graph as a metadata object dictionary, and eliminating the precursor relation and the successor relation of each metadata object in the directed weighted graph, which are not related to the candidate metadata relation in the metadata object dictionary, so as to obtain a legal metadata relation;
taking each metadata object generated by the directed weighted graph and a legal metadata relationship corresponding to each metadata object as the training sample;
calculating a flow session feature vector and a regression model prediction vector of each shunted training sample, wherein the steps comprise:
calculating a state transition list of each training sample after shunting, performing space compression on the state transition list of each training sample, dividing the state transition list into a plurality of mutually disjoint subsets, and performing coding operation by using different alphabet recoding aiming at each subset to obtain coding characteristic information of each subset;
combining similar coding feature information in the coding feature information of each subset through a state transition edge corresponding to the label type to obtain a streaming session feature vector of each training sample after shunting;
and performing regression model analysis on the flow session feature vector of each shunted training sample to obtain a regression model prediction vector of each shunted training sample.
2. The method according to claim 1, wherein the step of removing predecessor and successor relationships of each metadata object in the directed weighted graph that are not associated with a candidate metadata relationship in the metadata object dictionary to obtain a legal metadata relationship comprises:
judging whether the predecessor relationship and successor relationship of each metadata object in the directed weighted graph are matched with at least one candidate metadata relationship in the metadata object dictionary;
and when the predecessor relationship and successor relationship of any metadata object are not matched with at least one candidate metadata relationship in the metadata object dictionary, removing the predecessor relationship and successor relationship of the metadata object to obtain a legal metadata relationship.
3. The method for detecting and classifying data traffic capable of dynamic prediction according to claim 1, wherein a step of combining similar encoding feature information in the encoding feature information of each subset through a state transition edge corresponding to the label type to obtain a streaming session feature vector of each training sample after being shunted includes:
identifying a state transition matrix in the coding characteristic information of each subset through a state transition edge corresponding to the label type;
acquiring a target state transition matrix with the same state transition parameters, and determining coding characteristic information corresponding to the target state transition matrix as similar coding characteristic information;
and combining the similar coding feature information in the coding feature information of each subset to obtain the streaming session feature vector of each training sample after shunting.
4. The method for detecting and classifying data traffic capable of dynamically predicting according to any one of claims 1-3, wherein the step of inputting the stream session feature vector, the regression model prediction vector and the label type of each training sample into the data traffic detection classification model for training to obtain the trained target data traffic detection classification model comprises:
respectively establishing a corresponding first initialization weight and a second initialization weight for the stream session feature vector and the regression model prediction vector of each training sample;
inputting the first initialization weight and the second initialization weight into a data flow detection classification model, training a weak regression operator, and evaluating a training error of a label to which each training sample belongs by using the weak regression operator according to the label type;
selecting a corresponding error coefficient according to the training error to adjust the weak regression operator, and updating the weight distribution in each training sample;
judging whether the renewed weight distribution in each training sample meets a training end condition, when the training end condition is not met, iterating the training process until the training end condition is met, and obtaining an output result of the weak regression operator for each training sample, wherein the output result comprises a mean square error value, an inverse signal-to-noise ratio value and a maximum error value;
and updating the network parameters of the data traffic detection classification model according to the mean square error value, the inverse signal-to-noise ratio value and the maximum error value to obtain the trained target data traffic detection classification model.
5. A data flow detection and classification device capable of dynamically predicting, which is applied to a server, and comprises:
the acquisition module is used for acquiring multi-source heterogeneous big data for data flow detection and classification, analyzing the multi-source heterogeneous big data and acquiring a training sample;
the flow distribution calculation module is used for distributing the training samples according to the label types and calculating the flow session characteristic vector and the regression model prediction vector of each training sample after distribution;
the training module is used for inputting the streaming session feature vector, the regression model prediction vector and the label type of each training sample into a data traffic detection classification model for training to obtain a trained target data traffic detection classification model;
the classification module is used for detecting and classifying the multi-source heterogeneous data traffic to be classified according to the target data traffic detection and classification model;
the acquisition module is specifically configured to:
acquiring multi-source heterogeneous big data for data flow detection and classification;
modeling a directed weighted graph of the multi-source heterogeneous big data, representing entity attributes through a vertex of the directed weighted graph, representing relationships among the entity attributes through edges of the directed weighted graph, wherein the entity attributes are used for representing metadata objects of each data node in the multi-source heterogeneous big data, the relationships among the entity attributes are used for representing metadata relationships, and each metadata object is used as a data field in a database;
taking the metadata object generated by the directed weighted graph as a metadata object dictionary, and eliminating the precursor relation and the successor relation of each metadata object in the directed weighted graph, which are not related to the candidate metadata relation in the metadata object dictionary, so as to obtain a legal metadata relation;
taking each metadata object generated by the directed weighted graph and a legal metadata relationship corresponding to each metadata object as the training sample;
the shunt calculation module is specifically configured to:
calculating a state transition list of each training sample after shunting, performing space compression on the state transition list of each training sample, dividing the state transition list into a plurality of mutually disjoint subsets, and performing coding operation by using different alphabet recoding aiming at each subset to obtain coding characteristic information of each subset;
combining similar coding feature information in the coding feature information of each subset through a state transition edge corresponding to the label type to obtain a streaming session feature vector of each training sample after shunting;
and performing regression model analysis on the flow session feature vector of each shunted training sample to obtain a regression model prediction vector of each shunted training sample.
6. The dynamic predictive data traffic detection and classification apparatus according to claim 5, wherein the training module is specifically configured to:
respectively establishing a corresponding first initialization weight and a second initialization weight for the stream session feature vector and the regression model prediction vector of each training sample;
inputting the first initialization weight and the second initialization weight into a data flow detection classification model, training a weak regression operator, and evaluating a training error of a label to which each training sample belongs by using the weak regression operator according to the label type;
selecting a corresponding error coefficient according to the training error to adjust the weak regression operator, and updating the weight distribution in each training sample;
judging whether the renewed weight distribution in each training sample meets a training end condition, when the training end condition is not met, iterating the training process until the training end condition is met, and obtaining an output result of the weak regression operator for each training sample, wherein the output result comprises a mean square error value, an inverse signal-to-noise ratio value and a maximum error value;
and updating the network parameters of the data traffic detection classification model according to the mean square error value, the inverse signal-to-noise ratio value and the maximum error value to obtain the trained target data traffic detection classification model.
CN202010855720.1A 2020-08-24 2020-08-24 Data flow detection classification method and device capable of dynamically predicting Active CN111737371B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010855720.1A CN111737371B (en) 2020-08-24 2020-08-24 Data flow detection classification method and device capable of dynamically predicting

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010855720.1A CN111737371B (en) 2020-08-24 2020-08-24 Data flow detection classification method and device capable of dynamically predicting

Publications (2)

Publication Number Publication Date
CN111737371A CN111737371A (en) 2020-10-02
CN111737371B true CN111737371B (en) 2020-11-13

Family

ID=72658710

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010855720.1A Active CN111737371B (en) 2020-08-24 2020-08-24 Data flow detection classification method and device capable of dynamically predicting

Country Status (1)

Country Link
CN (1) CN111737371B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114666398B (en) * 2020-12-07 2024-02-23 深信服科技股份有限公司 Application classification method, device, equipment and storage medium

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105379204B (en) * 2014-01-14 2019-04-05 华为技术有限公司 Method and system for the resource for selecting data to route
US10043038B2 (en) * 2015-01-08 2018-08-07 Jumpshot, Inc. Identifying private information from data streams
EP3602297B1 (en) * 2017-03-29 2023-03-22 AB Initio Technology LLC Systems and methods for performing data processing operations using variable level parallelism
CN108062551A (en) * 2017-06-28 2018-05-22 浙江大学 A kind of figure Feature Extraction System based on adjacency matrix, figure categorizing system and method
CN111130942B (en) * 2019-12-27 2021-09-14 国网山西省电力公司信息通信分公司 Application flow identification method based on message size analysis
CN111563560B (en) * 2020-05-19 2023-05-30 上海飞旗网络技术股份有限公司 Data stream classification method and device based on time sequence feature learning

Also Published As

Publication number Publication date
CN111737371A (en) 2020-10-02

Similar Documents

Publication Publication Date Title
US11694094B2 (en) Inferring digital twins from captured data
CN104820708B (en) A kind of big data clustering method and device based on cloud computing platform
WO2022105129A1 (en) Content data recommendation method and apparatus, and computer device, and storage medium
CN113610239A (en) Feature processing method and feature processing system for machine learning
CN110134738B (en) Distributed storage system resource estimation method and device
EP3852007B1 (en) Method, apparatus, electronic device, readable storage medium and program for classifying video
CN113037783B (en) Abnormal behavior detection method and system
CN111931809A (en) Data processing method and device, storage medium and electronic equipment
CN111639230B (en) Similar video screening method, device, equipment and storage medium
CN116662817B (en) Asset identification method and system of Internet of things equipment
CN111062431A (en) Image clustering method, image clustering device, electronic device, and storage medium
CN113449011A (en) Big data prediction-based information push updating method and big data prediction system
CN111768242A (en) Order-placing rate prediction method, device and readable storage medium
CN111814759A (en) Method and device for acquiring face quality label value, server and storage medium
CN116662875A (en) Interface mapping method and device
CN111737371B (en) Data flow detection classification method and device capable of dynamically predicting
CN111368128A (en) Target picture identification method and device and computer readable storage medium
CN111444362A (en) Malicious picture intercepting method, device, equipment and storage medium
CN116737373A (en) Load balancing method, device, computer equipment and storage medium
CN115098679A (en) Method, device, equipment and medium for detecting abnormality of text classification labeling sample
CN110704153B (en) Interface logic analysis method, device and equipment and readable storage medium
US20210312323A1 (en) Generating performance predictions with uncertainty intervals
CN112182413A (en) Intelligent recommendation method and server based on big teaching data
CN113094415A (en) Data extraction method and device, computer readable medium and electronic equipment
CN112115316A (en) Box separation method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant