CN112788018B

CN112788018B - Multi-protocol data acquisition method, system, device and computer readable storage medium

Info

Publication number: CN112788018B
Application number: CN202011625185.7A
Authority: CN
Inventors: 冼振; 丁成; 尹运良
Original assignee: SHENZHEN TECHRISE ELECTRONICS CO Ltd
Current assignee: SHENZHEN TECHRISE ELECTRONICS CO Ltd
Priority date: 2020-12-30
Filing date: 2020-12-30
Publication date: 2022-12-27
Anticipated expiration: 2040-12-30
Also published as: CN112788018A

Abstract

The application relates to a multi-protocol data acquisition method, a system, a device and a computer readable storage medium, wherein the multi-protocol data acquisition method comprises the following steps: identifying a data transmission protocol corresponding to the data transmission channel by calculating a characteristic value; binding the data transmission channel with a corresponding data transmission protocol; and analyzing the acquired data according to the message format of the protocol. The method and the device have the effect of improving the protocol identification speed.

Description

Multi-protocol data acquisition method, system, device and computer readable storage medium

Technical Field

The present application relates to the field of multi-protocol data acquisition technologies, and in particular, to a multi-protocol data acquisition method, system, device, and computer-readable storage medium.

Background

In recent years, with the development of global informatization, a great number of network services and network protocols are applied to modern computer networks. While the network service function is continuously expanded, the requirement on the timeliness of data protocol identification is higher and higher, and an automatic protocol reverse analysis technology becomes a target pursued by people.

At present, the protocol can be generally identified by means of a port number, a header and a protocol identifier, for example: the following packet of the 645 protocol may be determined according to the first byte 68 and the last byte 16 of the packet: 681111111111116803103234B33A343333333733343334333412E716.

For the related technologies, the inventor believes that the reverse analysis of the protocol takes a long time, resulting in a slow data acquisition speed.

Disclosure of Invention

In order to avoid the problem of long time consumption of reverse protocol analysis, the application provides a multi-protocol data acquisition method, a multi-protocol data acquisition system, a multi-protocol data acquisition device and a computer-readable storage medium.

In a first aspect, the present application provides a multi-protocol data acquisition method, which adopts the following technical scheme:

a multi-protocol data acquisition method, comprising:

identifying a data transmission protocol corresponding to the data transmission channel by calculating a characteristic value;

binding the data transmission channel with a corresponding data transmission protocol;

and analyzing the acquired data according to the message format of the protocol.

By adopting the technical scheme, the data transmission protocol corresponding to the data transmission channel is identified by calculating the characteristic value, so that the time of reverse analysis of the protocol can be effectively shortened, and the data acquisition speed is further improved.

Preferably, the data transmission protocol for identifying the data transmission channel by calculating the characteristic value includes:

collecting any two sections of protocol messages in a data transmission channel;

calculating the characteristic values corresponding to the two sections of protocol messages;

and comparing the characteristic value with the characteristic value of the known protocol to identify the data transmission protocol corresponding to the data transmission channel.

By adopting the technical scheme, the protocol type of the tcp flow can be rapidly and accurately identified. Compared with other algorithms, the protocol identification method can realize the rapid and accurate identification of any protocol while ensuring the usability, and can automatically run in real time based on the developed protocol identification model.

Preferably, the calculating the characteristic values corresponding to the two segments of protocol packets includes:

any two sections of protocol messages in the collected data transmission channel are used as input and are processed by utilizing an activation function, and characteristic values corresponding to the two sections of protocol messages are obtained.

By adopting the scheme, the unlimited input is converted into the output in a predictable form, and a foundation is laid for the rapid identification of the subsequent protocol.

Preferably, when the characteristic value is compared with the characteristic value of the known protocol, if the characteristic value does not conform to the characteristic value, the weight and the bias are updated, the updated protocol identification model is obtained, and the corresponding characteristic value is recalculated.

By adopting the technical scheme, the weight coefficient and the offset value can be automatically calculated according to the recognition effect in the protocol recognition process, and are updated and applied to a new round of recognition, so that the accuracy of binary protocol recognition can be greatly improved. In addition, the method adopts an artificial neural network algorithm to identify and correct the binary protocol, automatically selects and extracts the protocol characteristic value by utilizing the similarity of network flow data and an image, directly uses the network flow data as the input of the neural network to supervise and learn, trains a network flow protocol identification model, and continuously updates and iterates the model by the method, so that the obtained protocol identification model can be optimized and operated in real time.

Preferably, the weights are updated in the following manner:

wherein, w _t+1 Represents the updated weight, w _t Representing the weight before update, η is a constant, C represents the loss function,

the loss function is represented to partially derive the weights.

The method updates the weight of the protocol identification model, thereby realizing automatic calculation of the characteristic value of the protocol and enabling the protocol identification to be more accurate and rapid.

Preferably, the bias is updated in the following manner:

wherein the content of the first and second substances,

in the formula, b _t+1 Representing updated bias, b _t Representing the bias before update, with η being a constant,

representing the partial derivative of the bias by a loss function; n is the number of times the loss function is recursively calculated (so as to minimize the loss function until the desired expected value is calculated); e.g. of the type _i Representing the input as x and the weight as w _i Biased by b _i The difference between the ideal output and the actual output, e, of each layer _i The value is changed from the current w _i And b _i And calculating to obtain the final product.

By adopting the technical scheme, the automatic calculation and updating of the characteristic value can be realized, and the more accurate characteristic value is finally obtained, so that the protocol identification is faster and more accurate.

Preferably, the value range of η is greater than or equal to 0 and less than or equal to 1.

By adopting the technical scheme, the calculated characteristic value is stable and convenient to calculate.

Preferably, the activation function adopts a sigmoid function.

By adopting the technical scheme, the output range of the Sigmoid function is limited between (0,1), so that the optimization is stable, the Sigmoid function can be used as an output layer, and the Sigmoid function is a continuous function and is convenient for derivation.

Preferably, the method further comprises the following steps: and scanning the message failed in decoding at regular time, and re-identifying the data transmission protocol corresponding to the data transmission channel by using the updated protocol identification model. Therefore, timing error correction can be carried out on the message with the decoding failure.

In a second aspect, the present application provides a multi-protocol data acquisition system, which adopts the following technical solutions:

a multi-protocol data acquisition system comprising:

the protocol identification module is used for identifying a data transmission protocol corresponding to the data transmission channel in a mode of calculating a characteristic value;

the binding module is used for binding the data transmission channel with a corresponding data transmission protocol;

and the data analysis module is used for analyzing the acquired data according to the message format of the protocol.

Preferably, the protocol identification module includes:

the protocol message acquisition module is used for acquiring any two sections of protocol messages in the data transmission channel;

the characteristic value calculating module is used for calculating the characteristic values corresponding to the two sections of protocol messages;

and the comparison module is used for comparing the characteristic value with the characteristic value of the known protocol and identifying the data transmission protocol corresponding to the data transmission channel.

By adopting the technical scheme, the protocol type of the tcp stream can be rapidly and accurately identified. Compared with other algorithms, the protocol identification method can realize quick and accurate identification of any protocol while ensuring the usability, and can automatically run in real time based on the developed protocol identification model.

Preferably, the eigenvalue calculation module takes any two segments of collected protocol messages in the data transmission channel as input, and processes the two segments of collected protocol messages by using an activation function to obtain the eigenvalues corresponding to the two segments of collected protocol messages.

Preferably, the protocol identification module further includes:

and the weight and bias updating module is used for updating the weight and bias if the characteristic value is not matched with the characteristic value of the known protocol during comparison of the characteristic value and the characteristic value of the known protocol, acquiring an updated protocol identification model and recalculating the corresponding characteristic value.

Preferably, the weight and bias updating module updates the weight and bias in the following manner:

wherein w _t+1 Represents the updated weight, w _t Representing the weight before update, η is a constant, C represents the loss function,

representing the partial derivative of the weight by a loss function;

wherein, the first and the second end of the pipe are connected with each other,

b _t+1 representing updated bias, b _t Representing the offset before update, with η being a constant,

representing the partial derivative of the bias by a loss function; n is the number of times the loss function is recursively calculated (so as to minimize the loss function until the desired expected value is calculated); e.g. of a cylinder _i Representing the input as x and the weight as w _i Biased by b _i And calculating the deviation of the obtained characteristic value and the actual characteristic value.

Preferably, the method further comprises the following steps:

and the timing scanning and processing module is used for scanning and decoding the failed message in a timing mode and re-identifying the data transmission protocol corresponding to the data transmission channel by using the updated protocol identification model. Therefore, the timing error correction of the message with the decoding failure can be realized.

In a third aspect, the present application provides a multi-protocol data acquisition apparatus, which adopts the following technical scheme:

a multi-protocol data acquisition apparatus comprising a memory and a processor, the memory having stored thereon a computer program that can be loaded by the processor and executed to perform any of the methods as described above.

In a fourth aspect, the present application provides a computer-readable storage medium, which adopts the following technical solutions:

a computer-readable storage medium storing a computer program that can be loaded by a processor and executed to perform a method as any one of the preceding.

In summary, the present application includes at least one of the following beneficial technical effects:

1. the data transmission protocol corresponding to the data transmission channel is identified by calculating the characteristic value, so that the time for reverse analysis of the protocol can be effectively shortened, and the data acquisition speed is further improved;

2. according to the method and the device, the weight coefficient and the offset value can be automatically calculated according to the recognition effect in the protocol recognition process, and are updated and applied to a new round of recognition, so that the accuracy of binary protocol recognition can be greatly improved. In addition, the method adopts an artificial neural network algorithm to identify and correct the binary protocol, automatically selects and extracts the protocol characteristic value by utilizing the similarity of network flow data and an image, directly uses the network flow data as the input of the neural network to supervise and learn, trains a network flow protocol identification model, and continuously updates and iterates the model by the method, so that the obtained protocol identification model can be optimized and operated in real time.

Drawings

FIG. 1 is a method flow diagram of one embodiment of the present application.

Fig. 2 is a flowchart of a method for identifying a data transmission protocol corresponding to a data transmission channel by calculating a characteristic value in an embodiment of the present application.

Detailed Description

The present application is described in further detail below with reference to figures 1-2.

The embodiment of the application discloses a multi-protocol data acquisition method. Referring to fig. 1, a multi-protocol data acquisition method includes:

s1, identifying a data transmission protocol corresponding to a data transmission channel by calculating a characteristic value;

s2, binding the data transmission channel with a corresponding data transmission protocol;

and S3, analyzing the acquired data according to the message format of the protocol.

Optionally, as shown in fig. 2, step S1 includes:

s11, collecting any two sections of protocol messages in a data transmission channel (namely intercepting any two sections of binary data streams in the transmission channel);

s12, calculating characteristic values corresponding to the two sections of protocol messages;

and S13, comparing the characteristic value with the characteristic value of the known protocol, and identifying the data transmission protocol corresponding to the data transmission channel.

The characteristic value of the known protocol is obtained by calculation through an activation function according to the known protocol.

Optionally, step S12 includes:

The weight and the bias of the activation function can be set to any value during initialization, and then updating optimization is performed step by step in the application process.

Optionally, as shown in fig. 2, in step S13, if there is no matching feature value when the feature value is compared with the feature value of the known protocol, the weight and the offset are updated to obtain an updated protocol identification model, and the corresponding feature value is recalculated.

When a new data channel protocol is identified subsequently, the protocol identification model after weight and bias updating is adopted, and by analogy, the protocol identification model is continuously updated, so that the accuracy of protocol identification can be improved.

Optionally, the weights may be updated in the following manner:

the loss function is represented to partially derive the weights. The initial value of the weight may be set to 0.

Optionally, the bias is updated in the following manner:

in the formula, b _t+1 Representing the updated bias, b _t Representing the offset before update, with η being a constant,

representing the partial derivative of the bias by a loss function; n is the number of times the loss function is recursively calculated (so as to minimize the loss function until the desired expected value is calculated); e.g. of the type _i Representing the input as x and the weight as w _i Biased by b _i And calculating the deviation of the obtained characteristic value and the actual characteristic value. The initial value of the offset may be set to 1.

Optionally, the value range of η may be greater than or equal to 0 and less than or equal to 1.

Optionally, the activation function may adopt a sigmoid function; the activation function can also adopt the functions of Tanh, reLU, leaky ReLU, PReLU, ELU or Maxout, etc.

In specific implementation, the message which fails to be decoded can be scanned at regular time, and the data transmission protocol corresponding to the data transmission channel is re-identified by using the updated protocol identification model.

The embodiment of the application also discloses a multi-protocol data acquisition system. A multi-protocol data acquisition system comprising:

Optionally, the protocol identification module includes:

Optionally, the eigenvalue calculation module takes any two segments of collected protocol messages in the data transmission channel as input, and processes the two segments of collected protocol messages by using an activation function to obtain eigenvalues corresponding to the two segments of collected protocol messages.

Optionally, the protocol identification module further includes:

Optionally, the weight and bias updating module updates the weight in the following manner:

the loss function is represented to partially derive the weights.

Optionally, the offset is updated using the following formula:

wherein the content of the first and second substances,

in the formula, b _t+1 Representing updated bias, b _t Representing the offset before update, with η being a constant,

Optionally, the value range of η is greater than or equal to 0 and less than or equal to 1.

Optionally, the method further includes:

and the timing scanning and processing module is used for scanning and decoding the failed message in a timing mode and re-identifying the data transmission protocol corresponding to the data transmission channel by using the updated protocol identification model.

The embodiment of the application also discloses a multi-protocol data acquisition device. A multi-protocol data acquisition apparatus comprising a memory and a processor, the memory having stored thereon a computer program that can be loaded by the processor and that executes the method of any one of claims 1 to 9.

The embodiment of the application also discloses a computer readable storage medium. A computer-readable storage medium storing a computer program which can be loaded by a processor and which performs the method of any one of claims 1 to 9.

The computer-readable storage medium includes, for example: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

Experimental example:

identifying tcp flow of a data transmission channel by using a protocol identification model trained in advance, judging a binary protocol corresponding to the tcp channel, and binding the tcp channel with a protocol identification code (namely a characteristic value corresponding to the protocol). And finally, calling a corresponding protocol to process the data stream of the tcp channel, analyzing the corresponding message and recording the decoding success rate. In addition, the message which fails to be decoded is scanned and identified in a timing mode, the protocol is identified again by using the updated protocol identification model, and the protocol identification of the tcp channel is marked again.

The training process of the protocol recognition model is as follows:

1. taking any two sections of protocol messages in a collected data transmission channel as input, initializing bias and corresponding weight (when the bias and the corresponding weight are initialized to be any values), and then multiplying the two inputs (protocol messages x1 and x 2) by the weight (w 1 and w 2):

x1→x1×w1

x2→x2×w2

2. adding the two results, plus an offset b, yields:

(x1×w1)+(x2×w2)+b

3. processing the data through an activation function (activation function) to obtain output (obtaining a characteristic value corresponding to the protocol message):

y＝f(x1×w1+x2×w2+b)。

the role of the activation function is to convert an unlimited input into an output of predictable form. One commonly used activation function is the sigmoid function.

The output of the sigmoid function is between 0 and 1, which is understood to compress numbers in the range (- ∞, + ∞) to within (0,1). The larger the positive value, the closer the output is to 1, and the larger the negative value, the closer the output is to 0.

For example, weights and biases in neurons are taken to be the following values:

w = [0,1] (w = [0,1] is a vector formal notation of w1=0, w2= 1), b =4.

Giving an input x = [2,3] to the neuron, calculating the output of the neuron (i.e. obtaining the characteristic value corresponding to the protocol message) in a vector dot product form:

w·x+b＝(x1×w1)+(x2×w2)+b＝0×2+1×3+4＝7；

y＝f(w·X+b)＝f(7)＝0.999。

4. comparing the characteristic value corresponding to the protocol message obtained by calculation with the characteristic value of the known protocol; if it is not equal to any of the known eigenvalues, calculating the partial derivatives of the loss function for all weights and biases, and updating each weight and bias using an update formula, wherein the weights are updated by:

wherein, w _t+1 Represents the updated weight, w _t Represents the weight before update, eta is a constant, which means the learning rate (learning rate) that determines the speed of the training network, C represents the loss function,

represents a loss function to make a partial derivative of the weight when

When it is positive, w _t+1 Will become smaller; when in use

When it is negative, w _t+1 It becomes large.

Updating the bias by:

wherein the content of the first and second substances,

By gradually changing the weight w and the bias b of the network by using the method, the loss function is slowly reduced, so that the neural network is improved, and the accuracy of protocol identification is greatly increased.

The loss function (loss function) is used for estimating the degree of inconsistency between the predicted value and the true value of the model, and is a non-negative true value function, and the smaller the loss function is, the better the robustness of the model is. The loss function is a core part of the empirical risk function and is also an important component of the structural risk function.

After the protocol recognition model is trained and updated by the method, the updated protocol recognition model is used for recognizing the tcp flow of the data transmission channel, the binary protocol corresponding to the tcp channel is judged, and then the tcp channel is bound with a protocol identification code (namely, a characteristic value corresponding to the protocol). And finally, calling a corresponding protocol to process the data stream of the tcp channel, analyzing the corresponding message and recording the decoding success rate. And then scanning and identifying the message failed in decoding at regular time, re-identifying the protocol by using the updated protocol identification model, and re-marking the protocol identifier of the tcp channel. And so on.

The above embodiments are preferred embodiments of the present application, and the protection scope of the present application is not limited by the above embodiments, so: all equivalent changes made according to the structure, shape and principle of the present application shall be covered by the protection scope of the present application.

Claims

1. A multi-protocol data acquisition method, comprising:

analyzing the collected multi-protocol data according to the message format of the data transmission protocol;

the data transmission protocol for identifying the data transmission channel by calculating the characteristic value comprises the following steps:

calculating the characteristic values corresponding to the two sections of protocol messages; the method specifically comprises the following steps: any two sections of protocol messages in the collected data transmission channel are used as input and are processed by utilizing an activation function, and characteristic values corresponding to the two sections of protocol messages are obtained;

comparing the characteristic value with the characteristic value of the known data transmission protocol, and identifying the data transmission protocol corresponding to the data transmission channel; if the eigenvalue is not matched with the eigenvalue of the known data transmission protocol during the comparison between the eigenvalue and the eigenvalue, updating the weight and the bias to obtain an updated data transmission protocol identification model, and recalculating the corresponding eigenvalue;

specifically, the weights are updated in the following manner:

representing the partial derivative of the weight by a loss function;

the bias is updated in the following way:

wherein the content of the first and second substances,

e _i ＝f(w _i ·x+b _i )-y′；b _t+1 representing updated bias, b _t Representing the bias before update, with η being a constant,

representing the partial derivative of the bias by a loss function; n is the number of times of recursively calculating the loss function; e.g. of a cylinder _i Representing the input as x and the weight as w _i Biased by b _i The deviation of the obtained characteristic value from the actual characteristic value is calculated, where f (-) represents the activation function.

2. The multi-protocol data acquisition method according to claim 1, wherein η is in a range of 0 to 1.

3. A multi-protocol data acquisition system, comprising:

the data analysis module is used for analyzing the collected multi-protocol data according to the message format of the data transmission protocol;

wherein, the protocol identification module comprises:

the protocol message acquisition submodule is used for acquiring any two sections of protocol messages in the data transmission channel;

the characteristic value operator module is used for calculating the characteristic values corresponding to the two sections of protocol messages; the method specifically comprises the following steps: any two sections of protocol messages in the collected data transmission channel are used as input and are processed by utilizing an activation function, and characteristic values corresponding to the two sections of protocol messages are obtained;

the comparison and transmission protocol identification submodule is used for comparing the characteristic value with the characteristic value of the known data transmission protocol and identifying the data transmission protocol corresponding to the data transmission channel;

the updating weight and bias updating submodule is used for updating the weight and bias if the characteristic value does not accord with the characteristic value when the characteristic value is compared with the characteristic value of the known data transmission protocol, and is used for obtaining an updated protocol identification model and recalculating the corresponding characteristic value;

specifically, in the update weight and bias update submodule, the weight is updated in the following manner:

representing the partial derivative of the weight by a loss function;

in the update weight and bias update submodule, the bias is updated in the following way:

wherein the content of the first and second substances,

e _i ＝f(w _i ·x+b _i )-y′；b _t+1 representing updated bias, b _t Representing the offset before update, with η being a constant,

representing the partial derivative of the bias by a loss function; n is the number of times of recursively calculating the loss function; e.g. of the type _i Representing the input as x and the weight as w _i Biased by b _i The deviation of the obtained characteristic value from the actual characteristic value is calculated, where f (-) represents the activation function.

4. A multi-protocol data acquisition device comprising a memory and a processor, the memory having stored thereon a computer program that can be loaded by the processor and that executes the method according to claim 1 or 2.

5. A computer-readable storage medium, in which a computer program is stored which can be loaded by a processor and which executes the method according to claim 1 or 2.