CN115037698B - Data identification method and device and electronic equipment - Google Patents

Data identification method and device and electronic equipment Download PDF

Info

Publication number
CN115037698B
CN115037698B CN202210597597.7A CN202210597597A CN115037698B CN 115037698 B CN115037698 B CN 115037698B CN 202210597597 A CN202210597597 A CN 202210597597A CN 115037698 B CN115037698 B CN 115037698B
Authority
CN
China
Prior art keywords
data
data stream
preset
session
command
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202210597597.7A
Other languages
Chinese (zh)
Other versions
CN115037698A (en
Inventor
罗耀祖
金少辉
王娟
吕玉超
周玉波
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tianyi Cloud Technology Co Ltd
Original Assignee
Tianyi Cloud Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tianyi Cloud Technology Co Ltd filed Critical Tianyi Cloud Technology Co Ltd
Priority to CN202210597597.7A priority Critical patent/CN115037698B/en
Publication of CN115037698A publication Critical patent/CN115037698A/en
Priority to PCT/CN2022/141581 priority patent/WO2023231391A1/en
Application granted granted Critical
Publication of CN115037698B publication Critical patent/CN115037698B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L47/00Traffic control in data switching networks
    • H04L47/10Flow control; Congestion control
    • H04L47/24Traffic characterised by specific attributes, e.g. priority or QoS
    • H04L47/2483Traffic characterised by specific attributes, e.g. priority or QoS involving identification of individual flows

Abstract

A data identification method, a device and an electronic device, wherein the method comprises the following steps: obtaining M data packets in N data packets in a session data stream, matching data in each data packet in the M data packets with each preset feature in a preset feature set one by one, obtaining an accumulated weight value corresponding to the session data stream according to a matching result, and determining the session data stream as a data stream corresponding to electronic mail and telecommunication in response to the accumulated weight value being larger than a preset threshold value. By the method, the data in each data packet in the M data packets at most in the session data stream is identified through each preset feature in the preset feature set, and each preset feature is associated with the weight value, so that server blocking caused by caching of the session data stream set in a server and detection of the data in each data packet in the session data stream are avoided, and the efficiency of detecting the session data stream is improved.

Description

Data identification method and device and electronic equipment
Technical Field
The present disclosure relates to the field of computer network security technologies, and in particular, to a data identification method, apparatus, and electronic device.
Background
With the development of computer network technology, e-mail is becoming more and more popular in daily work, and e-mail is also becoming a medium for network attack, so that in order to ensure a secure network environment, data of e-mail coming and going between a client and a server needs to be analyzed.
Because the transmission of the e-mail between the client and the server needs to establish a transmission control protocol (Transmission Control Protocol, TCP) connection, in order to improve the efficiency of the server in detecting the e-mail, the server needs to collect a session data stream set from the network data based on the TCP port number, where the session data stream in the session data stream set is a plurality of data packets generated when the client and the server perform the e-mail transmission, detect each session data stream in the session data stream set based on a preset keyword set, and determine that the session data stream is a data stream corresponding to the e-mail when at least one keyword in the preset keyword set exists in the session data stream.
However, when the method described above is used to detect the session data flows in the session data flow set, each data packet in the session data flow needs to be detected, and the server needs to cache a large number of session data flow sets, so that the server is blocked, and the efficiency of detecting the session data flow by the server is reduced.
Disclosure of Invention
The application provides a data identification method, a data identification device and electronic equipment, which are used for improving the detection efficiency of a conversation data stream and improving the identification accuracy of a server to the data stream of which the conversation data stream is an email.
In a first aspect, the present application provides a data identification method, the method comprising:
obtaining M data packets in N data packets in a session data stream, wherein N and M are positive integers, and M is less than or equal to N;
matching the data in each data packet in the M data packets with each preset feature in a preset feature set one by one, and obtaining an accumulated weight value corresponding to the session data stream according to a matching result, wherein the accumulated weight value is an accumulated weight value obtained by adding weight values respectively associated with all preset features respectively matched with the data in each data packet in the M data packets;
and determining the conversation data stream as a data stream corresponding to the email and the email in response to the accumulated weight value being greater than a preset threshold.
By the method, the single session data stream is detected, and only the data in the first M data packets in the session data stream are detected, so that the problem that the server is blocked and the efficiency of detecting the session data stream is lower due to the fact that the server caches a large number of session data stream sets is avoided, and the detection efficiency of the session data stream is improved.
In one possible design, before matching the data in each of the M data packets with each preset feature in the preset feature set one by one, the method includes:
acquiring a command keyword set and a command response keyword set from a preset protocol, wherein the command keyword set is uplink data sent to a server by a client, and the command response keyword set is downlink data sent to the client by the server;
associating each command keyword in the command keyword set with a command keyword identifier, generating an uplink feature word string set corresponding to the command keyword set, associating each command response keyword in the command response keyword set with a command response identifier, and generating a downlink feature word string set corresponding to the command response keyword set, wherein the identifier comprises a position range of the command keyword or the command response keyword and a character code;
inputting the uplink characteristic word string set and the downlink characteristic word string set into a preset characteristic compiling model to obtain a preset characteristic set.
In one possible design, the matching the data in each of the M data packets with each preset feature in the preset feature set one by one, and obtaining the accumulated weight value corresponding to the session data stream according to the matching result includes:
Extracting source port numbers and/or destination port numbers corresponding to all data packets in the session data flow one by one, and determining target preset feature sets corresponding to all data packets in the session data flow one by one from preset feature sets based on the source port numbers and/or the destination port numbers;
matching the data in each data packet in the M data packets with preset features in the target preset feature set one by one to obtain weight values of the detected data packets, wherein the weight values represent the probability that the data in the data packets are data in data streams corresponding to emails;
gradually accumulating the weight values of the detected data packets to obtain accumulated weight values corresponding to the session data flow.
In one possible design, the method further comprises:
detecting whether the total number of the detected data packets in the session data stream exceeds a preset value or not in response to the accumulated weight value being smaller than a preset threshold value;
if yes, marking the session data stream, and ignoring the undetected residual data packets in the marked session data stream;
if not, continuing to detect the data in the undetected data packet in the session data stream.
In a second aspect, the present application provides a data recognition apparatus, the apparatus comprising:
the acquisition module is used for acquiring M data packets in N data packets in the session data stream;
the matching module is used for matching the data in each data packet in the M data packets with each preset feature in a preset feature set one by one, and obtaining an accumulated weight value corresponding to the session data flow according to a matching result;
and the response module is used for responding to the accumulated weight value being larger than a preset threshold value and determining the session data stream as the data stream corresponding to the electronic mail and the telecommunication.
In one possible design, the obtaining module is specifically configured to obtain a command keyword set and a command response keyword set from a preset protocol, associate each command keyword in the command keyword set with a command keyword identifier, generate an uplink feature word string set corresponding to the command keyword set, associate each command response keyword in the command response keyword set with a command response identifier, generate a downlink feature word string set corresponding to the command response keyword set, and input the uplink feature word string set and the downlink feature word string set into a preset feature compiling model to obtain a preset feature set.
In one possible design, the matching module is specifically configured to extract a source port number and/or a destination port number corresponding to each data packet in the session data flow one by one, determine, from a preset feature set, a target preset feature set corresponding to each data packet in the session data flow one by one based on the source port number and/or the destination port number, match data in each data packet in the M data packets one by one with preset features in the target preset feature set, obtain weight values of each detected data packet, gradually accumulate the weight values of each detected data packet, and obtain accumulated weight values corresponding to the session data flow.
In one possible design, the response module is specifically configured to, in response to the accumulated weight value being smaller than a preset threshold, mark the session data stream when the total number of detected data packets in the session data stream reaches a preset value, ignore the remaining undetected data packets in the session data stream marked, and continue to detect data in the undetected data packets in the session data stream when the total number of detected data packets in the session data stream is smaller than the preset value.
In a third aspect, the present application provides an electronic device, including:
a memory for storing a computer program;
and the processor is used for realizing the data identification method steps when executing the computer program stored in the memory.
In a fourth aspect, a computer readable storage medium has stored therein a computer program which, when executed by a processor, implements a data recognition method step as described above.
The technical effects of each of the first to fourth aspects and the technical effects that may be achieved by each aspect are referred to above for the technical effects that may be achieved by the first aspect or the various possible aspects of the first aspect, and are not repeated here.
Drawings
Fig. 1 is a schematic diagram of email transmission between a client and a server provided in the present application;
FIG. 2 is a flowchart illustrating steps of a method for data identification provided in the present application;
fig. 3 is a schematic structural diagram of a data identification device provided in the present application;
fig. 4 is a schematic structural diagram of an electronic device provided in the present application.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the present application more apparent, the present application will be described in further detail with reference to the accompanying drawings. The specific method of operation in the method embodiment may also be applied to the device embodiment or the system embodiment. It should be noted that "a plurality of" is understood as "at least two" in the description of the present application. "and/or", describes an association relationship of an association object, and indicates that there may be three relationships, for example, a and/or B, and may indicate: a exists alone, A and B exist together, and B exists alone. A is connected with B, and can be represented as follows: both cases of direct connection of A and B and connection of A and B through C. In addition, in the description of the present application, the words "first," "second," and the like are used merely for distinguishing between the descriptions and not be construed as indicating or implying a relative importance or order.
In the prior art, a schematic diagram of electronic mail transmission between a client and a server is shown in fig. 1, if the client needs to transmit electronic mail to the server, information interaction needs to be performed between the client and the server, the information needs to pass through a server, and when the server confirms whether a session data stream set is a data stream corresponding to the electronic mail, the server needs to collect the session data stream set from network data based on a TCP port number, and when detecting the session data stream in the session data stream set, each data packet in the session data stream needs to be detected, and because the server caches a large number of session data stream sets, the server is blocked, so that the efficiency of detecting the session data stream by the server is reduced.
In order to solve the above-mentioned problems, an embodiment of the present application provides a data identification method, which is used to improve the efficiency of detecting a session data stream by a server. The method and the device according to the embodiments of the present application are based on the same technical concept, and because the principles of the problems solved by the method and the device are similar, the embodiments of the device and the method can be referred to each other, and the repetition is not repeated.
Embodiments of the present application are described in detail below with reference to the accompanying drawings.
Referring to fig. 2, the present application provides a data identification method, which can improve the efficiency of detecting a session data stream by a server, and the implementation flow of the method is as follows:
step S21: m data packets in N data packets in the session data stream are acquired.
Because the data stream corresponding to the e-mail is transmitted in the form of data packets in the transmission process, the source IP address, the destination IP address, the source port number and the destination port number of the data stream corresponding to the e-mail are active during transmission, and the number of the data packets simultaneously received by the server is at least 1 when the server receives the data packets sent by the client or the server, at least one data packet received by the server is derived from at least one session data stream, each session data stream sends the respective data packet to the server in the form of a single data packet, the server determines the data packets belonging to different session data streams by identifying the source IP address and/or the destination IP address of the received at least one data packet, and the server extracts the source port number and/or the destination port number of the data packet after identifying which data packet of the received session data stream is the data packet, and the source port number and/or the destination port number of the data packet are/is used for identifying the sending port of the data packet.
In order to achieve the purpose of improving the efficiency of detecting the session data flow by the server, first, the server needs to obtain the session data flow from the network data based on the TCP port number, where the session data flow may be a (Simple Mail Transfer Protocol, SMTP) e-mail transmission protocol session data flow, and when the session data flow is an SMTP session data flow, the SMTP session data flow includes at least 8 data packets, and the session data flow may also be another type of session data flow, where the number of data packets in the session data flow may be adjusted according to the actual situation, and this application will take the session data flow as an SMTP session data flow as an example, and other types of session data flows refer to the embodiment of this application, which will not be described herein too much.
After the session data stream is obtained, the session data stream contains data packets with invalid data, the data packets of the invalid data are data packets with empty data portions, so that in order to avoid detecting the invalid data packets and further prolong the detection time of the session data stream, when the data packets received by the server are invalid data packets, the server ignores the data packets, receives the data packets one by one, detects the received data packets, the number of the data packets required to be detected by the server is N, but the total number of the data packets detected by the server is M, and N is more than or equal to M in the process of detecting the data packets.
In one possible design, the server may further obtain a complete session data stream, determine N data packets with valid data from the session data stream, and after determining the N data packets, to further improve the detection efficiency of the session data stream, and reconfirm M data packets from the N data packets.
By the method, the first M data packets are determined from N data packets in the session data stream, so that detection of each data packet in the session data stream is avoided, the number of the data packets to be detected is reduced, and the efficiency of detecting the session data stream is improved.
Step S22: and matching the data in each data packet in the M data packets with each preset feature in a preset feature set one by one, and obtaining an accumulated weight value corresponding to the session data stream according to a matching result.
After the session data stream is obtained, the session data stream needs to be detected, and the detection of the session data stream is to match the data in each of the M data packets determined above with the preset feature in the preset feature set one by one, so that the preset feature set needs to be obtained, and the process of obtaining the preset feature set is as follows:
Because the e-mail is information interaction between the client and the server, and when the client and the server transmit the e-mail, the e-mail needs to be transmitted based on an SMTP e-mail transmission protocol, and the SMTP e-mail transmission protocol records a rule for transmitting the e-mail, and the client and the server need to perform information interaction according to the rule, whether the session data stream is a data stream corresponding to the e-mail can be judged based on the rule.
Because the sending end of the email is a client or a server, and the sending rules of different sending ends of the email are recorded in the SMTP email transmission protocol, in order to judge that the session data stream is the data stream corresponding to the email, the corresponding rules need to be extracted from the SMTP email transmission protocol based on different sending ends of the email, and the specific extraction process is as follows:
in the embodiment of the application, the rule in the SMTP email transmission protocol is used as a keyword, and because of different sending ends during email transmission, uplink data which is specified in the SMTP email transmission protocol and is sent to the server by the client is used as a command keyword, when the client sends a request to the server, the request contains the command keyword, the server replies based on the request, and the downlink data of the server responding to the client is the command response keyword.
Further, since various data sent by the client is recorded in the SMTP email transmission protocol, more than one command keyword is recorded in the SMTP email transmission protocol, and similarly, various data sent by the server is recorded in the SMTP email transmission protocol, and therefore, more than one command response keyword is recorded in the SMTP email transmission protocol, in the embodiment of the present application, the set of command keywords is referred to as a command keyword set, and the set of command response keywords is referred to as a command response keyword set.
The command keyword set obtained from SMTP email transfer protocol is shown in table 1:
command keywords Data corresponding to command keywords
HELO Host name of client
MAIL FROM Sender(s)
RCPT TO Intended addressee
DATA Mail body
...... ......
TABLE 1
The command keywords in the request sent by the client to the server are recorded in the above table 1, only 4 command keywords and data corresponding to the 4 command keywords are listed in the table 1, a command keyword set can be formed based on each command keyword in the above table 1, other command keywords and descriptions corresponding to the command keywords can be obtained from the SMTP email transmission protocol, and the command keywords and the descriptions corresponding to the command keywords after the obtaining are referred to any one example in the above table 1, and will not be described in any more detail here.
The command response keyword set obtained from SMTP email transfer protocol is shown in table 2:
command response keywords Description of the invention
220 Service ready
221 Service shut-down transmission channel
250 Request command completion
...... ......
TABLE 2
The above table 2 records various command response keywords used by the server to respond to the client, and descriptions corresponding to each command response keyword, the above description describes 3 command response keywords and descriptions corresponding to each command response keyword, the above command response keywords and descriptions corresponding to command response keywords may be extracted from the SMTP email transmission protocol, and the command response keywords set can be formed based on each command response keyword in the above table 2, and since the SMTP email transmission protocol is disclosed, other command response keywords are not listed one by one, and the descriptions corresponding to other command response keywords are not described in any of the examples in the table 2.
The above description is based on the command keyword set and the command response keyword set extracted by the SMTP email transmission protocol, the session data stream corresponding to the SMTP email can be identified based on the SMTP email transmission protocol, when the type of email is not SMTP email, the command keyword set and the command response keyword set can be extracted based on other email protocols according to the above method, and other types of email protocols can be post office protocol Version 3 (Post Office Protocol-Version 3, pop 3) protocol, internet message protocol (Internet Message Access Protocol, IMAP) protocol, and the like, and since the process of extracting the command keyword set and the command response keyword in the email protocol is identical to the process described above, too much description is not made here.
After the command keyword set and the command response keyword set are obtained, in order to distinguish the command keyword set and the command response keyword set, it is necessary to associate the command keyword identifier with each command keyword in the command keyword set to obtain a command keyword set associated with the keyword identifier, and associate the command response keyword identifier with each command response keyword in the command response keyword set to obtain a command response keyword set associated with the command response keyword identifier, where the command keyword identifier is a position range of the command keyword and a character code of the command keyword, and the command response keyword is a position range of the command response keyword and a character code of the command response keyword.
After obtaining the command keyword set of the associated keyword mark and the command response keyword set of the associated command response keyword mark, taking the command keyword set of the associated keyword mark as an uplink preset feature set, and taking the command response keyword set of the associated command response keyword mark as a downlink preset feature set, wherein the uplink preset feature set is uplink preset features contained in a request sent to a server by a client, and the downlink preset feature set is downlink preset features contained in a request of the server responding to the client, and the uplink preset feature set and the downlink preset feature set are shown in table 3:
TABLE 3 Table 3
In table 3, two command keywords identified by related command keywords in the command keyword set are listed, the MAIL FROM is a command keyword in the command keyword set, [0-9] MAIL|20|FROM is an uplink preset feature generated based on the MAIL FROM, the [0-9] and the |20| are command keyword identifications, the [0-9] represents that the DATA in the range of 0-9 contains MAIL, the |20| represents the character encoding of the space, the ASCII code value of the space is 32, the 32 is converted into x20 in hexadecimal system, and [ = ] DATA|20| represents that the DATA in the DATA packet is identical to the command keyword and the number of characters of the DATA and the command keyword is identical.
The command response keywords of two related command response keywords in the command response keyword set are listed in the above table 3, 250 is the command response keyword in the command response keyword set, [0-11]250|20|MAIL|20|OK is a reply TO MAIL FROM in the request sent by the client, wherein [0-11] and |20| are command response keyword identifications, [0-11]250|20|RCPT|20|OK is a reply TO RCPT TO in the request sent by the client, the above [0-11] is the position range where 250 is located, and |20| is the character code of the space.
The storage form of the other command keywords in the command keyword set in table 3 refers to the description of the two uplink preset features listed in the uplink preset feature set in table 3, and the storage form of the other command response keywords in the command response keyword set in table 3 refers to the storage form of any one of the two downlink preset features listed in table 3, and since the number of the uplink preset features in the uplink preset feature set and the number of the downlink preset features in the downlink preset feature set are too many, they are not exemplified herein.
After the uplink preset feature set and the downlink preset feature set are obtained, the uplink preset feature set and the downlink preset feature set are input into a preset feature compiling model to obtain the preset feature set, and preset features in the preset feature set are shown in table 4:
TABLE 4 Table 4
The preset features in the preset feature set are recorded in the above table 4, and include the preset feature set in the uplink direction and the preset feature set in the downlink direction, after being compiled by the Hyperscan rule compiler, [0-9] mail|20|from is compiled into 4d 41 49 4c 20 46 52 4f 4d, [ = ] data|20|is compiled into 44 41 54 41 20, [0-11]250|20|mail|20|ok is compiled into 32 35 30 20 4d 41 49 4c 20 4f 4b, [0-11]250|20|rcpt|20|ok is compiled into 32 35 30 20 52 43 50 54 20 4f 4b, the above table 4 only describes the preset features corresponding to the two uplink preset features and the two downlink preset features after being compiled, and the other uplink preset features and the downlink preset features are compiled and stored in the form in the table 4 referring to the above table 4, which will not be explained.
It should be noted that, when the session data stream matches the preset feature in the preset feature set, the matched preset feature is inconsistent with the probability of identifying that the session data stream is a data stream corresponding to the email, for example: the preset feature on the match is A, B, C, if the session data stream is matched with the a, the session data stream can be determined to be the data stream corresponding to the email, if the session data stream is matched with the B or the C, the session data stream is not determined to be the data stream corresponding to the email, and judgment needs to be performed by combining other preset features, so that in order to further improve the recognition accuracy of the session data stream, each preset feature in the preset feature set needs to be associated with a weight value, the weight value represents the probability that the preset feature is the data stream corresponding to the email for the session data stream, and the larger the weight value is, the larger the probability that the preset feature is the data stream corresponding to the email for the session data stream is, and the smaller the weight value is, the smaller the probability that the preset feature is the data stream corresponding to the email for the session data stream is represented.
The specific process of associating each preset feature in the preset feature set with a weight value is as follows:
Since the preset feature set is derived from the SMTP email transmission protocol, the uplink preset feature set and the downlink preset feature set extracted based on the SMTP email transmission protocol include: compared with other types of email protocols, the SMTP email transmission protocol has uplink preset features and/or downlink preset features; and the probability of recognizing that the conversation data stream is the SMTP email based on the uplink preset feature and/or the downlink preset feature is higher.
Further, the higher the repetition degree of the uplink preset feature and/or the downlink preset feature in the preset feature set and the uplink preset feature and/or the downlink preset feature in other types of emails is, the lower the probability that the session data stream is identified as the SMTP email is, because the preset feature set is compiled based on the SMTP email transmission protocol, when the Hyperscan rule compiler compiles the uplink preset feature set and the downlink preset feature set, the preset feature set corresponding to other types of emails can be obtained by adopting the method described above, and the weight value of each preset feature in the preset feature set obtained based on the SMTP email transmission protocol in the embodiment of the application is determined based on the repetition degree of the preset feature in the plurality of preset feature sets, and the preset feature set with the corresponding weight value is associated, and because the obtaining of the weight value is generated based on the Hyperscan rule compiler, the Hyperscan rule is not excessively set forth herein.
Such as: the preset feature set of the associated weight values is shown in table 5 below:
preset features Weight value corresponding to preset feature
z 0.09
x 0.11
c 0.10
A 0.10
B 0.02
...... ......
TABLE 5
The foregoing table 5 records 5 preset features and weight values corresponding to each preset feature, and a storage form of weight values associated with an uplink preset feature in an uplink preset feature set and a storage form of weight values associated with a downlink preset feature in a downlink preset feature set in the preset feature set are referred to any one example in the foregoing table 5, which is not described in detail herein.
It should be noted that, after the preset feature set of the unassociated weight values is obtained in the embodiment of the present application, the weight value of each preset feature in the preset feature set may be recorded as 1.
After the preset feature set of the association weight value is obtained in the above manner, in order to detect the data packets in the session data stream based on the preset feature set, the data in the described maximum M data packets need to be matched with each preset feature in the preset feature set one by one, and a specific matching process is as follows:
because the e-mail has different sending ends, in order to avoid matching the data packets in the session data stream with all the features in the preset feature set and avoid detecting each data packet in the session data stream, a source port number and/or a destination port number need to be extracted from the session data stream, a target preset feature set can be determined from the preset feature set based on the source port number and/or the destination port number, and when the source port number extracted from the session data stream is the port number of the client or the extracted destination port number is the port number of the service end, the target preset feature set is the preset feature set in the uplink direction; when the source port number extracted from the session data stream is the port number of the service end, or when the extracted destination port number is the port number of the client end, the target preset feature set is a preset feature set in the downlink direction.
After the target preset feature set is obtained, matching the data in each data packet in the M data packets with preset features in the target preset feature set one by one, obtaining a weight value corresponding to each data packet after the data in each data packet is matched with the preset features in the target preset feature set, repeating the process, and accumulating the weight values corresponding to the detected data packets to obtain the accumulated weight value of the detected data packets.
Through the method, the preset feature sets of the association weight values of the preset features are obtained, the preset feature sets comprise the preset feature sets in the uplink direction and the preset feature sets in the downlink direction, and when the data packets in the session data stream are detected, only the target preset feature set in the preset feature set is needed to detect part of the data packets in the session data stream, so that each data packet in the session data stream is prevented from being detected, and the accuracy of identifying the session data stream as the data stream corresponding to the E-mail and the efficiency of detecting the session data stream are improved.
Step S23: and determining the conversation data stream as the data stream corresponding to the email and the email in response to the accumulated weight value being greater than a preset threshold.
The above describes the process of obtaining the accumulated weight value of the session data stream, when the session data stream is the data stream corresponding to the email after the accumulated weight value of the session data stream is obtained, the accumulated weight value of the session data stream will be greater than a preset threshold, and the server will respond to the accumulated weight value and determine the session data stream as the data stream corresponding to the email, where it is required to be noted that the preset threshold can be adjusted based on the actual situation.
When the session data stream is a data stream corresponding to the non-electronic mail, the accumulated weight value of the session data stream is smaller than a preset threshold value, and the server responds to the fact that the accumulated weight value is smaller than the preset threshold value, so that in order to avoid that all data packets to be detected in the session data stream are not detected, the accumulated weight value of the session data stream is smaller than the preset threshold value, and further, the identification of the session data stream is inaccurate, whether the total number of detected data packets in the session data stream exceeds a preset value is judged, when the total number of the data packets in the session data stream reaches the preset value, the fact that M data packets to be detected are all detected is judged, and further, the session data stream can be determined to be the data stream corresponding to the non-electronic mail, and the session data stream is marked and discarded; when the total number of the data packets in the session data stream is lower than a preset value, the detection is not completed for all the M data packets to be detected, and the data packets which are not detected in the session data stream and need to be detected are continuously detected.
Such as: the session data stream has 8 data packets, each data packet is { A, B, C, D, E, F, G, H }, the number of data packets to be detected is 5, and the weight value calculated after the detection of each data packet to be detected is shown in table 6:
TABLE 6
In table 6, the data packets to be detected and the weight values calculated after the detection of each detected data packet are recorded, and the weight value associated with each target preset feature in the target preset feature set in table 6 is 1, and the weight values may be other values, which are not specifically described herein.
If the preset threshold value is 3, after the first detection of a, the calculated weight value is 1, because the detection times 1<5 are adopted, the second detection of C is carried out, the third detection of E is carried out, the calculated accumulated weight value is 1+0=1, 2<5, the calculated accumulated weight value is 1+1=2, 3<5, the fourth detection of G is carried out, the calculated accumulated weight value is 2+1=3, and at this time, 3 is not lower than the preset threshold value, so that the detection can be stopped, and the session dataset is determined to be the data stream corresponding to the E-mail; if the preset threshold is 5, the above procedure is required to be repeated until the accumulated weight value is 4 after the fifth detection of F is completed, at this time, all the data packets to be detected are detected, the session data stream is marked, and the rest data packets in the session data stream are ignored.
By the method, the conversation data stream is determined to be the data stream corresponding to the E-mail based on the relation between the accumulated weight value and the preset threshold value, and the preset threshold value can be adjusted, so that the accuracy of determining that the conversation data stream is the data stream corresponding to the E-mail is higher.
Based on the description, each preset feature in the session data stream and the preset feature set is detected, and whether the session data stream is a data stream corresponding to the e-mail is judged according to the accumulated weight value obtained after the session data stream is detected, so that the data in part of data packets in a single session data stream is detected, the pressure of a server for storing the session data stream is reduced, the efficiency of detecting the session data stream is improved, and the accuracy of identifying that the session data stream is the data stream corresponding to the e-mail is improved.
Based on the same inventive concept, the embodiment of the present application further provides a data identification device, where the data identification device is configured to implement a function of a data identification method, and referring to fig. 3, the device includes:
an obtaining module 301, configured to obtain M data in N data packets in a session data stream;
The matching module 302 is configured to match data in each of the M data packets with each preset feature in a preset feature set one by one, so as to obtain an accumulated weight value corresponding to the session data flow;
and a response module 303, configured to determine the session data stream as a data stream corresponding to the email and the email in response to the accumulated weight value being greater than a preset threshold.
In one possible design, the obtaining module 301 is specifically configured to obtain a command keyword set and a command response keyword set from a preset protocol, associate each command keyword in the command keyword set with a command keyword identifier, generate an uplink feature string set corresponding to the command keyword set, associate each command response keyword in the command response keyword set with a command response identifier, generate a downlink feature string set corresponding to the command response keyword set, and input the uplink feature string set and the downlink feature string set into a preset feature compiling model to obtain a preset feature set.
In one possible design, the matching module 302 is specifically configured to extract a source port number and/or a destination port number corresponding to each data packet in the session data flow one by one, determine, from a preset feature set, a target preset feature set corresponding to each data packet in the session data flow one by one based on the source port number and/or the destination port number, match data in each data packet in the M data packets one by one with preset features in the target preset feature set, obtain weight values of each detected data packet, gradually accumulate the weight values of each detected data packet, and obtain accumulated weight values corresponding to the session data flow.
In one possible design, the response module 303 is specifically configured to, in response to the accumulated weight value being smaller than a preset threshold, mark the session data stream when the total number of detected data packets in the session data stream reaches a preset value, ignore the remaining undetected data packets in the session data stream marked, and continue to detect data in the undetected data packets in the session data stream when the total number of detected data packets in the session data stream is smaller than the preset value.
Based on the same inventive concept, the embodiment of the present application further provides an electronic device, where the electronic device may implement the function of the foregoing data identifying apparatus, and referring to fig. 4, the electronic device includes:
at least one processor 401, and a memory 402 connected to the at least one processor 401, in this embodiment of the present application, a specific connection medium between the processor 401 and the memory 402 is not limited, and in fig. 4, the processor 401 and the memory 402 are connected by a bus 400 as an example. The bus 400 is shown in bold lines in fig. 4, and the manner in which the other components are connected is illustrated schematically and not by way of limitation. The bus 400 may be divided into an address bus, a data bus, a control bus, etc., and is represented by only one thick line in fig. 4 for ease of illustration, but does not represent only one bus or one type of bus. Alternatively, the processor 401 may be referred to as a controller, and the name is not limited.
In the embodiment of the present application, the memory 402 stores instructions executable by the at least one processor 401, and the at least one processor 401 may perform a data identification method as described above by executing the instructions stored in the memory 402. Processor 401 may implement the functions of the various modules in the apparatus shown in fig. 3.
The processor 401 is a control center of the apparatus, and various interfaces and lines can be used to connect various parts of the entire control device, and by executing or executing instructions stored in the memory 402 and invoking data stored in the memory 402, various functions of the apparatus and processing data can be performed, so that the apparatus is monitored as a whole.
In one possible design, processor 401 may include one or more processing units, and processor 401 may integrate an application processor and a modem processor, wherein the application processor primarily processes operating systems, user interfaces, application programs, and the like, and the modem processor primarily processes wireless communications. It will be appreciated that the modem processor described above may not be integrated into the processor 401. In some embodiments, processor 401 and memory 402 may be implemented on the same chip, and in some embodiments they may be implemented separately on separate chips.
The processor 401 may be a general purpose processor such as a Central Processing Unit (CPU), digital signal processor, application specific integrated circuit, field programmable gate array or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, which may implement or perform the methods, steps and logic blocks disclosed in the embodiments of the present application. The general purpose processor may be a microprocessor or any conventional processor or the like. The steps of a data identification method disclosed in connection with the embodiments of the present application may be directly embodied in a hardware processor for execution, or may be executed by a combination of hardware and software modules in the processor.
Memory 402 is a non-volatile computer-readable storage medium that can be used to store non-volatile software programs, non-volatile computer-executable programs, and modules. The Memory 402 may include at least one type of storage medium, which may include, for example, flash Memory, hard disk, multimedia card, card Memory, random access Memory (Random Access Memory, RAM), static random access Memory (Static Random Access Memory, SRAM), programmable Read-Only Memory (Programmable Read Only Memory, PROM), read-Only Memory (ROM), charged erasable programmable Read-Only Memory (Electrically Erasable Programmable Read-Only Memory), magnetic Memory, magnetic disk, optical disk, and the like. Memory 402 is any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer, but is not limited to such. The memory 402 in the present embodiment may also be circuitry or any other device capable of implementing a memory function for storing program instructions and/or data.
By programming the processor 401, the code corresponding to a data recognition method described in the foregoing embodiment may be cured into the chip, so that the chip can perform a data recognition step of the embodiment shown in fig. 2 at run-time. How to design and program the processor 401 is a technology well known to those skilled in the art, and will not be described in detail here.
Based on the same inventive concept, the embodiments of the present application also provide a storage medium storing computer instructions that, when executed on a computer, cause the computer to perform a data identification method as previously discussed.
In some possible embodiments, the present application provides that aspects of a data identification method may also be implemented in the form of a program product comprising program code for causing the control apparatus to carry out the steps of a data identification method according to the various exemplary embodiments of the present application as described herein above when the program product is run on an apparatus.
It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
It will be apparent to those skilled in the art that various modifications and variations can be made in the present application without departing from the spirit or scope of the application. Thus, if such modifications and variations of the present application fall within the scope of the claims and the equivalents thereof, the present application is intended to cover such modifications and variations.

Claims (8)

1. A method of data identification, comprising:
obtaining M data packets in N data packets in a session data stream, wherein N and M are positive integers, and M is less than or equal to N;
matching the data in each data packet in the M data packets with each preset feature in a preset feature set one by one, and obtaining an accumulated weight value corresponding to the session data stream according to a matching result, wherein the accumulated weight value is an accumulated weight value obtained by adding weight values respectively associated with all preset features, which are respectively matched with the data in each data packet detected in the M data packets;
Determining the session data stream as a data stream corresponding to the electronic mail and the electronic mail in response to the accumulated weight value being greater than a preset threshold;
detecting whether the total number of the detected data packets in the session data stream reaches a preset value or not in response to the accumulated weight value being smaller than a preset threshold value;
if yes, marking the session data stream, and ignoring the undetected residual data packets in the marked session data stream;
if not, continuing to detect the data in the undetected data packet in the session data stream.
2. The method of claim 1, wherein before matching the data in each of the M data packets with each of the predetermined features in the predetermined feature set one by one, comprising:
acquiring a command keyword set and a command response keyword set from a preset protocol, wherein the command keyword set is uplink data sent to a server by a client, and the command response keyword set is downlink data sent to the client by the server;
associating each command keyword in the command keyword set with a command keyword identifier, generating an uplink feature word string set corresponding to the command keyword set, associating each command response keyword in the command response keyword set with a command response identifier, and generating a downlink feature word string set corresponding to the command response keyword set, wherein the identifier comprises a position range of the command keyword or the command response keyword and a character code;
Inputting the uplink characteristic word string set and the downlink characteristic word string set into a preset characteristic compiling model to obtain a preset characteristic set.
3. The method of claim 1, wherein matching the data in each of the M data packets with each preset feature in a preset feature set one by one, and obtaining the accumulated weight value corresponding to the session data stream according to the matching result, comprises:
extracting source port numbers and/or destination port numbers corresponding to all data packets in the session data flow one by one, and determining target preset feature sets corresponding to all data packets in the session data flow one by one from preset feature sets based on the source port numbers and/or the destination port numbers;
matching the data in each data packet in the M data packets with preset features in the target preset feature set one by one to obtain weight values of the detected data packets, wherein the weight values represent the probability that the data in the data packets are data in data streams corresponding to emails;
gradually accumulating the weight values of the detected data packets to obtain accumulated weight values corresponding to the session data flow.
4. A data recognition device, the device comprising:
The acquisition module is used for acquiring M data packets in N data packets in the session data stream;
the matching module is used for matching the data in each data packet in the M data packets with each preset feature in a preset feature set one by one, and obtaining an accumulated weight value corresponding to the session data flow according to a matching result;
and the response module is used for responding to the accumulated weight value being larger than a preset threshold value, determining the session data stream as a data stream corresponding to the email and the email, marking the session data stream when the total number of detected data packets in the session data stream reaches a preset value, ignoring the undetected residual data packets in the marked session data stream when the accumulated weight value is smaller than the preset threshold value, and continuously detecting the data in the undetected data packets in the session data stream when the total number of detected data packets in the session data stream is smaller than the preset value.
5. The apparatus of claim 4, wherein the obtaining module is specifically configured to obtain a command keyword set and a command response keyword set from a preset protocol, associate each command keyword in the command keyword set with a command keyword identifier, generate an uplink feature string set corresponding to the command keyword set, associate each command response keyword in the command response keyword set with a command response identifier, generate a downlink feature string set corresponding to the command response keyword set, and input the uplink feature string set and the downlink feature string set into a preset feature compiling model to obtain a preset feature set.
6. The apparatus of claim 4, wherein the matching module is specifically configured to extract a source port number and/or a destination port number corresponding to each data packet in the session data stream one by one, determine a target preset feature set corresponding to each data packet in the session data stream one by one from a preset feature set based on the source port number and/or the destination port number, match data in each data packet in the M data packets one by one with preset features in the target preset feature set, obtain weight values of each detected data packet, gradually accumulate the weight values of each detected data packet, and obtain accumulated weight values corresponding to the session data stream.
7. An electronic device, comprising:
a memory for storing a computer program;
a processor for carrying out the method steps of any one of claims 1-3 when executing a computer program stored on said memory.
8. A computer-readable storage medium, characterized in that the computer-readable storage medium has stored therein a computer program which, when executed by a processor, implements the method steps of any of claims 1-3.
CN202210597597.7A 2022-05-30 2022-05-30 Data identification method and device and electronic equipment Active CN115037698B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202210597597.7A CN115037698B (en) 2022-05-30 2022-05-30 Data identification method and device and electronic equipment
PCT/CN2022/141581 WO2023231391A1 (en) 2022-05-30 2022-12-23 Data identification method and apparatus and electronic device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210597597.7A CN115037698B (en) 2022-05-30 2022-05-30 Data identification method and device and electronic equipment

Publications (2)

Publication Number Publication Date
CN115037698A CN115037698A (en) 2022-09-09
CN115037698B true CN115037698B (en) 2024-01-02

Family

ID=83122061

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210597597.7A Active CN115037698B (en) 2022-05-30 2022-05-30 Data identification method and device and electronic equipment

Country Status (2)

Country Link
CN (1) CN115037698B (en)
WO (1) WO2023231391A1 (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115037698B (en) * 2022-05-30 2024-01-02 天翼云科技有限公司 Data identification method and device and electronic equipment

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101039226A (en) * 2007-03-13 2007-09-19 杭州华三通信技术有限公司 Device and method for recognizing point-to-point application
CN101741908A (en) * 2009-12-25 2010-06-16 青岛朗讯科技通讯设备有限公司 Identification method for application layer protocol characteristic
US8112484B1 (en) * 2006-05-31 2012-02-07 Proofpoint, Inc. Apparatus and method for auxiliary classification for generating features for a spam filtering model
CN102833263A (en) * 2012-09-07 2012-12-19 北京神州绿盟信息安全科技股份有限公司 Method and device for intrusion detection and intrusion protection
CN112511457A (en) * 2019-09-16 2021-03-16 华为技术有限公司 Data stream type identification method and related equipment
CN113630418A (en) * 2021-08-16 2021-11-09 杭州安恒信息安全技术有限公司 Network service identification method, device, equipment and medium
CN114070800A (en) * 2021-10-29 2022-02-18 复旦大学 SECS2 traffic rapid identification method combining deep packet inspection and deep stream inspection

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8332366B2 (en) * 2006-06-02 2012-12-11 International Business Machines Corporation System and method for automatic weight generation for probabilistic matching
CN102523223B (en) * 2011-12-20 2014-08-27 北京神州绿盟信息安全科技股份有限公司 Trojan detection method and apparatus thereof
CN106709346B (en) * 2016-11-25 2019-08-06 腾讯科技(深圳)有限公司 Document handling method and device
CN109450740A (en) * 2018-12-21 2019-03-08 青岛理工大学 A kind of SDN controller carrying out traffic classification based on DPI and machine learning algorithm
CN111404768A (en) * 2019-01-02 2020-07-10 中国移动通信有限公司研究院 DPI recognition realization method and equipment
US11882138B2 (en) * 2020-06-18 2024-01-23 International Business Machines Corporation Fast identification of offense and attack execution in network traffic patterns
CN115037698B (en) * 2022-05-30 2024-01-02 天翼云科技有限公司 Data identification method and device and electronic equipment

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8112484B1 (en) * 2006-05-31 2012-02-07 Proofpoint, Inc. Apparatus and method for auxiliary classification for generating features for a spam filtering model
CN101039226A (en) * 2007-03-13 2007-09-19 杭州华三通信技术有限公司 Device and method for recognizing point-to-point application
CN101741908A (en) * 2009-12-25 2010-06-16 青岛朗讯科技通讯设备有限公司 Identification method for application layer protocol characteristic
CN102833263A (en) * 2012-09-07 2012-12-19 北京神州绿盟信息安全科技股份有限公司 Method and device for intrusion detection and intrusion protection
CN112511457A (en) * 2019-09-16 2021-03-16 华为技术有限公司 Data stream type identification method and related equipment
CN113630418A (en) * 2021-08-16 2021-11-09 杭州安恒信息安全技术有限公司 Network service identification method, device, equipment and medium
CN114070800A (en) * 2021-10-29 2022-02-18 复旦大学 SECS2 traffic rapid identification method combining deep packet inspection and deep stream inspection

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
ClassBench:A Packet Classification Benchmark;David E.Taylor;IEEE;全文 *
J. Klensin ; .Simple Mail Transfer Protocol.IETF rfc5321.2008,全文. *
基于多核处理和DPI的网络流量监控系统的设计与实现;王化铎;《中国优秀硕士学位论文全文数据库》;全文 *
基于行为特征加权的P2P流识别方法的研究;崔燕;汪斌强;陈庶樵;张震;;计算机工程与设计(20);正文第2章节 *

Also Published As

Publication number Publication date
CN115037698A (en) 2022-09-09
WO2023231391A1 (en) 2023-12-07

Similar Documents

Publication Publication Date Title
CN110149266B (en) Junk mail identification method and device
CN108200034B (en) Method and device for identifying domain name
US20080104702A1 (en) Network-based internet worm detection apparatus and method using vulnerability analysis and attack modeling
CN110070445B (en) Transaction processing method and device based on blockchain system
CN111641658A (en) Request intercepting method, device, equipment and readable storage medium
CN115037698B (en) Data identification method and device and electronic equipment
CN110798488B (en) Web application attack detection method
CN104333483A (en) Identification method, system and identification device for internet application flow
CN103414701A (en) Rule matching method and device
CN114070800B (en) SECS2 flow quick identification method combining deep packet inspection and deep flow inspection
CN105100023B (en) Data packet feature extracting method and device
CN110417643A (en) Email processing method and device
CN104333461A (en) Identification method, system and identification device for internet application flow
CN111124421B (en) Abnormal contract data detection method and device for blockchain intelligent contract
CN111741127B (en) Communication connection blocking method and device, electronic equipment and storage medium
CN104933178A (en) Official website determining method and system
CN116800518A (en) Method and device for adjusting network protection strategy
CN116346961A (en) Financial message processing method and device, electronic equipment and storage medium
CN111049724A (en) Mail security check method, device, computer equipment and storage medium
US20200099718A1 (en) Fuzzy inclusion based impersonation detection
CN105357166A (en) Next-generation firewall system and packet detection method thereof
CN114363059A (en) Attack identification method and device and related equipment
CN112491651B (en) Message matching method and device
CN114186637A (en) Traffic identification method, traffic identification device, server and storage medium
CN112989315B (en) Fingerprint generation method, device and equipment for terminal of Internet of things and readable storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant