CN114389872A - Data processing method, model training method, electronic device, and storage medium - Google Patents

Data processing method, model training method, electronic device, and storage medium Download PDF

Info

Publication number
CN114389872A
CN114389872A CN202111681609.6A CN202111681609A CN114389872A CN 114389872 A CN114389872 A CN 114389872A CN 202111681609 A CN202111681609 A CN 202111681609A CN 114389872 A CN114389872 A CN 114389872A
Authority
CN
China
Prior art keywords
data
sender
characteristic value
sending
determining
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111681609.6A
Other languages
Chinese (zh)
Inventor
李涵
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhuo Erzhi Lian Wuhan Research Institute Co Ltd
Original Assignee
Zhuo Erzhi Lian Wuhan Research Institute Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhuo Erzhi Lian Wuhan Research Institute Co Ltd filed Critical Zhuo Erzhi Lian Wuhan Research Institute Co Ltd
Priority to CN202111681609.6A priority Critical patent/CN114389872A/en
Publication of CN114389872A publication Critical patent/CN114389872A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/14Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
    • H04L63/1408Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic by monitoring network traffic
    • H04L63/1425Traffic logging, e.g. anomaly detection
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • G06N20/10Machine learning using kernel methods, e.g. support vector machines [SVM]

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Computing Systems (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Computer Security & Cryptography (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Artificial Intelligence (AREA)
  • Medical Informatics (AREA)
  • Signal Processing (AREA)
  • Computer Hardware Design (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Data Mining & Analysis (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The embodiment of the application discloses a data processing method, which comprises the following steps: acquiring a first data transceiving condition of a sender of data to be classified in a preset time period; determining a first characteristic value of a sender based on a first data volume sent by the sender at a first Internet Protocol (IP) address and a second data volume sent by the first IP address, wherein the first data volume is indicated by a first data receiving and sending condition; the first IP address is an IP address for sending data to be classified; determining a second characteristic value of the sender based on the distribution situation of the plurality of data sending moments of the sender indicated by the first data sending and receiving situation; determining the category of the sender based on the first characteristic value, the second characteristic value and the target classification model; wherein, the data processing modes for different types of senders are different. Therefore, the accuracy of judging the category of the sender is improved through the behavior characteristics of the sender represented by the first characteristic value and the data sent by the IP address and the time characteristics of the data sent by the sender represented by the second characteristic value.

Description

Data processing method, model training method, electronic device, and storage medium
Technical Field
The present invention relates to the field of deep learning, and in particular, to a data processing method, a model training method, an electronic device, and a storage medium.
Background
In the related art, received data is detected, for example, when detecting whether a received email is a spam email, a host behavior and a mailbox behavior of a sent email are generally analyzed only by log information, an abnormal flow in an email protocol flow is determined according to an analysis result of the sending behavior, and whether the received email is a spam email is determined according to the abnormal flow. However, the dimension of the feature content based on the analysis method is limited, which results in insufficient detection accuracy.
Disclosure of Invention
In view of this, embodiments of the present invention provide a data processing method, a model training method, an electronic device, and a storage medium.
The technical scheme of the invention is realized as follows:
in a first aspect, an embodiment of the present invention provides a data processing method, where the method includes:
acquiring a first data transceiving condition of a sender of data to be classified in a preset time period;
determining a first characteristic value of the sender based on a first data volume sent by the sender at a first Internet Protocol (IP) address and a second data volume sent by the first IP address, wherein the first data volume is indicated by the first data transceiving condition; the first IP address is an IP address for sending the data to be classified;
determining a second characteristic value of the sender based on the distribution situation of the plurality of data sending moments of the sender indicated by the first data sending and receiving situation;
determining a category of the sender based on the first feature value, the second feature value and a target classification model; wherein the data processing modes for the different types of the senders are different.
Further, the determining a second characteristic value of the sender based on the distribution of the multiple data sending time instants of the sender indicated by the first data sending and receiving condition includes:
determining a time interval between every two adjacent data sending moments based on the distribution condition of the plurality of data sending moments of the sending party indicated by the first data sending and receiving condition;
and determining a second characteristic value of the sender based on the distribution situation of the time intervals.
Further, the determining a second characteristic value of the sender based on the distribution of the time intervals includes:
determining a third data volume sent at the data sending time after each time interval and a fourth data volume sent by the sender in the preset time period based on the distribution condition of the time intervals;
calculating a ratio of the third data amount to the fourth data amount;
and determining a second characteristic value of the sender based on the ratio and the natural logarithm of the ratio.
Further, the method further comprises:
constructing a weighted directed graph based on the association party which is indicated by the first data receiving and sending condition and has a data receiving and sending relation with the sender; one node in the weighted directed graph characterizes one of the senders or one of the associated parties;
generating a third feature value set based on the node information and/or the weight of the edge in the weighted directed graph; the edge represents a data receiving and sending relation between two nodes; the weight represents a fifth data volume sent by the starting node connected with the edge to the pointing node connected with the edge in the preset time period;
the determining the category of the sender based on the first feature value, the second feature value and the target classification model includes:
and inputting the first characteristic value, the second characteristic value and the third characteristic value set into a target classification model, and determining the category of the sender.
Further, the generating a third set of eigenvalues based on the weights of the node information and/or edges in the weighted directed graph comprises:
calculating a fourth characteristic value based on a first number of edges characterizing data sent by the sender and a second number of edges characterizing data received by the sender;
calculating a fifth characteristic value based on the weight of the edge representing the data sent by the sender;
calculating a sixth characteristic value based on the weight of an edge connected with a correspondent node of a related party belonging to the same IP address as the sender;
generating a third set of eigenvalues based on the fourth, fifth and sixth eigenvalues.
Further, the calculating a fifth eigenvalue based on the weight characterizing the edge of the data sent by the sender includes:
and calculating the weight average value of all edges representing the data sent by the sender in the weighted directed graph as a fifth characteristic value.
In a second aspect, an embodiment of the present invention provides a model training method, where the method includes:
acquiring a second data receiving and sending condition of a sender of the sample data in a preset time period;
determining a seventh characteristic value of the sender based on the data volume sent by the sender on the second IP address and the data volume sent by the second IP address indicated by the second data transceiving condition; the second IP address is an IP address for sending the sample data;
determining an eighth characteristic value of the sender based on the distribution situation of the plurality of data sending moments of the sender indicated by the second data sending and receiving situation;
training a preset classification model based on the seventh characteristic value and the eighth characteristic value to obtain a classification value of the preset classification model;
determining a training loss value according to the difference between the classification value and a label of a sender of the sample data;
stopping training the preset classification model when the training loss value meets a preset condition;
and when the training loss value does not meet the preset condition, continuing to train the classification model.
Further, the determining an eighth characteristic value of the sender based on the distribution of the multiple data sending moments of the sender indicated by the second data sending and receiving condition includes:
determining a time interval between every two adjacent data sending moments based on the distribution condition of the plurality of data sending moments of the sending party indicated by the second data sending and receiving condition;
and determining an eighth characteristic value of the sender based on the distribution situation of the time intervals.
In a third aspect, an embodiment of the present invention provides an electronic device, where the electronic device includes: a processor and a memory for storing a computer program capable of running on the processor;
the processor, when running said computer program, performs the steps of one or more of the preceding claims.
In a fourth aspect, an embodiment of the present invention provides a computer-readable storage medium storing computer-executable instructions; the computer-executable instructions, when executed by a processor, are capable of implementing the methods described in one or more of the preceding claims.
The data processing method provided by the invention comprises the following steps: acquiring a first data transceiving condition of a sender of data to be classified in a preset time period; determining a first characteristic value of the sender based on a first data volume sent by the sender at a first Internet Protocol (IP) address and a second data volume sent by the first IP address, wherein the first data volume is indicated by the first data transceiving condition; the first IP address is an IP address for sending the data to be classified; determining a second characteristic value of the sender based on the distribution situation of the plurality of data sending moments of the sender indicated by the first data sending and receiving situation; determining a category of the sender based on the first feature value, the second feature value and a target classification model; wherein the data processing modes for the different types of the senders are different. Therefore, the occupation condition of the data transmission quantity of the sender relative to the IP address where the current sender is located is embodied through the first characteristic value, so that the behavior characteristic of the sender for transmitting data in the preset time period can be represented, and the time characteristic of the sender for transmitting data in the preset time period can be represented through the second characteristic value. On the basis, the first characteristic value and the second characteristic value are processed through the classification model, so that the category of the sender can be accurately judged according to the behavior characteristic and the time characteristic, the data sending characteristic of the sender is met, the dimensionality of the characteristic values is enriched, the accuracy of category judgment is improved, and the accuracy of data processing mode selection is improved.
Drawings
Fig. 1 is a schematic flow chart of a data processing method according to an embodiment of the present invention;
FIG. 2 is a schematic flow chart of a model training method according to an embodiment of the present invention;
fig. 3 is a schematic structural diagram of a data processing apparatus according to an embodiment of the present invention;
fig. 4 is a schematic flowchart of a mail detection method according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention clearer, the present invention will be further described in detail with reference to the accompanying drawings, the described embodiments should not be construed as limiting the present invention, and all other embodiments obtained by a person of ordinary skill in the art without creative efforts shall fall within the protection scope of the present invention.
In the following description, reference is made to "some embodiments" which describe a subset of all possible embodiments, but it is understood that "some embodiments" may be the same subset or different subsets of all possible embodiments, and may be combined with each other without conflict.
In the description that follows, references to the terms "first \ second \ third" are intended merely to distinguish similar objects and do not denote a particular order, but rather are to be understood that the terms "first \ second \ third" may be interchanged under certain circumstances or sequences of events to enable embodiments of the invention described herein to be practiced in other than those illustrated or described herein.
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. The terminology used herein is for the purpose of describing embodiments of the invention only and is not intended to be limiting of the invention.
As shown in fig. 1, an embodiment of the present invention provides a data processing method, where the method includes:
s110: acquiring a first data transceiving condition of a sender of data to be classified in a preset time period;
s120: determining a first characteristic value of the sender based on a first data volume sent by the sender at a first Internet Protocol (IP) address and a second data volume sent by the first IP address, wherein the first data volume is indicated by the first data transceiving condition; the first IP address is an IP address for sending the data to be classified;
s130: determining a second characteristic value of the sender based on the distribution situation of the plurality of data sending moments of the sender indicated by the first data sending and receiving situation;
s140: determining a category of the sender based on the first feature value, the second feature value and a target classification model; wherein the data processing modes for the different types of the senders are different.
In the embodiment of the present invention, the data to be classified may be received data that needs to be processed, for example, data such as a mail and a short message may be used. Taking data as an email as an example, the data amount may be the number of sent emails, and the category of the sender of the email may include a normal category and an abnormal category, and accordingly, the email sent by the sender of the normal category may be a normal email, and the email sent by the sender of the abnormal category may be a spam email. Therefore, whether the received mail is spam or not can be accurately identified based on the first characteristic value and the second characteristic value.
Here, the preset time period is a preset time period for determining the data transceiving conditions, and may be, for example, within one hour, within two hours, within one day, and the like from the current time or before the time of receiving the data to be classified. The first data transceiving condition corresponding to the sender can be obtained by accessing data transceiving log information, traffic information and the like of the sender, and is used for indicating the condition that the sender sends and receives data within a preset time period.
In one embodiment, an Internet Protocol (IP) address where a sender sends current data to be classified is determined as a first IP address, so that in a first data transceiving situation, a first data volume which is totally sent by the current sender at the first IP address and a second data volume which is totally sent by all different senders at the first IP address can be determined.
In another embodiment, the first amount of data outm=|Vik|,VikCharacterizing the amount of data sent by the sender i at the first IP address k, the second amount of data
Figure BDA0003442189430000061
The total amount of data sent by the n senders on the first IP address k is represented. Here, n may be the number of senders who have sent data on the first IP address k. The first characteristic value may be calculated by weighting the first data amount and the second data amount, or may be a ratio of the first data amount and the second data amount, for example, the first characteristic value Fm_IP=outm/outIP. Thus, the first characteristic value may represent the occupancy rate of the sender for the first IP address, and for a spammer who continuously changes the mailbox address, i.e. the sender identity, taking data as an example, Fm_IPThe value approaches 0, whereas for a normal sender, Fm_IPThe distribution of values is relatively uniform.
In one embodiment, S130 may include: and determining the distribution condition of all data transmission moments transmitted by the transmitter in a preset time period according to the first data transceiving condition, and determining a second characteristic value based on the distribution condition. Here, the distribution of the data transmission timings may be a distribution of time intervals between transmission timings at which data is transmitted each time within a preset period, or a degree of dispersion of all transmission timings within the preset period, or a distribution in which the preset period is divided into a certain number of sub-periods, the number of transmission timings included in each sub-period is determined, or the like.
In another embodiment, the target classification model is used for classifying the sender of the data according to the first feature value and the second feature value, for example, the target classification model may be a model constructed by a machine learning algorithm such as a Support Vector Machine (SVM). The data label output by the target classification model may be a data label characterizing the category of the sender, for example, the data label may be 0 or 1 characterizing the second classification result, 0 may characterize the sender as a normal category, 1 may characterize the sender as an abnormal category, or 1,2, 3 …, etc., characterizing the multi-classification result, and are respectively used for characterizing different categories.
In yet another embodiment, data transmitted by different classes of senders may be processed in different ways. Taking data as an example, when a sender is a normal type, the mail can be determined as a normal mail, so that the mail can be displayed or stored in a mail group to be processed; when the sender is in an abnormal category, the mail can be determined as junk mail, so that the mail can be deleted or saved in a junk mail group.
Therefore, based on the combination of the first characteristic value and the second characteristic value, the behavior characteristic and the time characteristic of the data sending behavior of the sender can be comprehensively represented through the data sending quantity comparison condition of the sender relative to the IP address where the current sender is located and the distribution condition of a plurality of data sending moments. Therefore, the method can better fit the actual data sending situation of the sender, and enriches the dimensionality and the content of the characteristic value, thereby improving the accuracy of judging the category of the sender and further more accurately selecting a data processing mode.
In some embodiments, the S130 may include:
determining a time interval between every two adjacent data sending moments based on the distribution condition of the plurality of data sending moments of the sending party indicated by the first data sending and receiving condition;
and determining a second characteristic value of the sender based on the distribution situation of the time intervals.
In the embodiment of the present invention, taking data as an example of a mail, compared with a normal mail sender, a sender of a spam mail needs to continuously send spam mails to different mail addresses in order to deliver spam mails to different recipients as much as possible. Therefore, most mails of the spammer are sent in a short time interval, the distribution of the time interval has larger values in a short time interval, the values in other time intervals are smaller, and the mail sending time intervals of the normal senders are randomly distributed and difficult to concentrate.
In one embodiment, a time interval distribution histogram corresponding to the data transmission time may be determined according to the first data transceiving situation. For example, the time t for transmitting n data within a preset time period for the transmitting side ii,jAnd j is 1,2, …, n, the corresponding time interval is determined as di,j=ti,j+1-ti,j. Thus, according to di,jThe time interval distribution histogram can be determined, and the second characteristic value can be determined based on the distribution characteristic of the histogram.
In another embodiment, determining the second characteristic value of the sender based on the distribution of the time intervals may include determining the second characteristic value based on the number of the time intervals smaller than a preset threshold, or based on the distribution uniformity of the time intervals. For example, the second characteristic value and the like are determined based on the proportion of the number of time intervals smaller than 1min occupying the total time intervals within the preset period.
Therefore, based on the time interval distribution condition of data sending, the time sequence rule of the sender in data sending can be better embodied, and the category of the sender can be more accurately distinguished.
In some embodiments, the determining a second characteristic value of the sender based on the distribution of the time intervals includes:
determining a third data volume sent at the data sending time after each time interval and a fourth data volume sent by the sender in the preset time period based on the distribution condition of the time intervals;
calculating a ratio of the third data amount to the fourth data amount;
and determining a second characteristic value of the sender based on the ratio and the natural logarithm of the ratio.
In the embodiment of the invention, after the time interval is determined, the corresponding relation between each time interval and each corresponding third data volume is established, and further, the proportion of the data volume sent after each time interval is determined based on the third data volume and the fourth data volume which is totally sent in the preset time period.
In one embodiment, the third amount of data transmitted at the data transmission time after each time interval is determined, and may be the third amount of data transmitted at least one data transmission time after the same time interval is determined. For example, for time interval di,j=ti,j+1-ti,jDetermining di,jThe sum of the data amounts transmitted at the plurality of data transmission times corresponding to D is the third data amount, that is, the third data amount represents the distribution of the data amounts transmitted at different time intervals.
In another embodiment, by pDP (x ═ D) represents the ratio of the third data amount to the fourth data amount, i.e., the ratio of the data amount of the transmission time interval D to the total transmission data amount. Further, determining a second characteristic value of the sender based on the ratio and the natural logarithm of the ratio may include passing Ft=-∑pD ln(pD) The characterization is based on a time characteristic of the distribution of the data transmission time intervals, i.e. a second characteristic value. Thus, taking data as an example of a mail, after the above-mentioned processing, the sender of spam mail has an uneven time interval distribution, and FtIs relatively small, whereas for a normal mail sender, FtThe value of (a) is relatively large. Based on this, the category of the sender can be clearly and accurately distinguished.
In some embodiments, the method further comprises:
constructing a weighted directed graph based on the association party which is indicated by the first data receiving and sending condition and has a data receiving and sending relation with the sender; one node in the weighted directed graph characterizes one of the senders or one of the associated parties;
generating a third feature value set based on the node information and/or the weight of the edge in the weighted directed graph; the edge represents a data receiving and sending relation between two nodes; the weight represents a fifth data volume sent by the starting node connected with the edge to the pointing node connected with the edge in the preset time period;
the determining the category of the sender based on the first feature value, the second feature value and the target classification model includes:
and inputting the first characteristic value, the second characteristic value and the third characteristic value set into a target classification model, and determining the category of the sender.
In the embodiment of the present invention, the weighted directed graph may include a plurality of nodes and edges connected between two nodes, where one node represents one sender or one associated party, the edges have directivity, and the weight of one edge represents the amount of data sent to a connected pointing node along a pointing direction of the edge to the connected starting node. For example, two edges may be connected between node a and node B, where the weight of one edge represents the amount of data a sends to B, and the weight of the other edge represents the amount of data B sends to a.
In this way, the data transceiving behavior of the sender can be more clearly shown through the construction of the weighted directed graph. The third eigenvalue set generated based on the node information and/or the edge weight in the weighted directed graph can more accurately and finely characterize the data transmission condition of the sender and the associated party having a data transceiving relation with the sender, and further can better embody the data transmission behavior characteristic of the sender.
In some embodiments, the generating a third set of eigenvalues based on the weights of the node information and/or edges in the weighted directed graph comprises:
calculating a fourth characteristic value based on a first number of edges characterizing data sent by the sender and a second number of edges characterizing data received by the sender;
calculating a fifth characteristic value based on the weight of the edge representing the data sent by the sender;
calculating a sixth characteristic value based on the weight of an edge connected with a correspondent node of a related party belonging to the same IP address as the sender;
generating a third set of eigenvalues based on the fourth, fifth and sixth eigenvalues.
In this embodiment of the present invention, the first number may be the number of outgoing edges connected to one node in the weighted directed graph, that is, the number of edges pointing to other nodes from one node, and the second number may be the number of incoming edges connected to one node, that is, the number of edges pointing to the node from other nodes. The fourth characteristic value may be a ratio of the first quantity to the second quantity, or may be calculated by the first quantity and the second quantity according to a certain weight.
In one embodiment, | viI represents the number of edges pointing from node i to other nodes, i.e. the first number, | eiI represents the number of edges pointing from other nodes to node i, i.e., the second number. Through | ei|/|viAnd | calculating a fourth characteristic value, and representing the proportion of the edge of the node i receiving the data to the edge sending the data. Thus, the fourth characteristic value may represent a balance between data transmission and reception of the sender, and taking data as a mail, the sender of the spam mail may have poor mail transmission and reception balance, and the number of edges for transmitting data may be much larger than the number of edges for receiving data.
In another embodiment, the fifth feature value is calculated based on the weights of the outgoing edges of the node where the sender is located, the fifth feature value may be calculated by determining an average value of the weights of all the outgoing edges of the node, or may be calculated by weighting according to different coefficients based on the weight of each outgoing edge. Therefore, the overall level and the average level of the data quantity of the data sent to other nodes by the sending party can be embodied based on the fifth characteristic value, and the category of the sending party can be analyzed more favorably.
In yet another embodiment, the sixth characteristic value is calculated based on the weights of the edges connected with the nodes corresponding to the associator whose sender belongs to the same IP address, and the sum of the outgoing weights of all nodes located at the same IP address k, that is, the sum of the data amounts sent by all nodes including the sender node and the associator section at the first IP address, may be the sixth characteristic value. Therefore, the data sending condition of the first IP address can be better reflected, and the category of the sender and the characteristics of the current data to be classified from the IP address are further judged.
Therefore, the third characteristic value set is described based on other characteristic values of multiple layers, and the characteristic value dimensionality in the input target classification model is more comprehensively enriched, so that the method is more favorable for accurately distinguishing the sender category.
In some embodiments, the calculating a fifth eigenvalue based on the weight characterizing the edge of the data sent by the sender includes:
and calculating the weight average value of all edges representing the data sent by the sender in the weighted directed graph as a fifth characteristic value.
Here, the first distance of the weights of all outgoing edges of the sender node may be calculated, that is, an average value of the weights may be obtained as the fifth characteristic value according to a ratio of a sum of the weights of all outgoing edges to the number of all outgoing edges.
As shown in fig. 2, an embodiment of the present invention provides a model training method, where the method includes:
s210: acquiring a second data receiving and sending condition of a sender of the sample data in a preset time period;
s220: determining a seventh characteristic value of the sender based on the data volume sent by the sender on the second IP address and the data volume sent by the second IP address indicated by the second data transceiving condition; the second IP address is an IP address for sending the sample data;
s230: determining an eighth characteristic value of the sender based on the distribution situation of the plurality of data sending moments of the sender indicated by the second data sending and receiving situation;
s240: training a preset classification model based on the seventh characteristic value and the eighth characteristic value to obtain a classification value of the preset classification model;
s250: determining a training loss value according to the difference between the classification value and a label of a sender of the sample data;
s260: stopping training the preset classification model when the training loss value meets a preset condition;
s270: and when the training loss value does not meet the preset condition, continuing to train the classification model.
In embodiments of the present invention, the sample data may be data that has been historically received and that the sender has been classified, e.g., mail that has been historically received and that has been determined to be normal mail or spam. The second data transceiving condition may be determined according to a data transceiving condition obtained by classifying the sample data, or may directly obtain a corresponding data transceiving condition of the sending party from the sending party, for example, by obtaining log information and traffic information.
Here, the preset period in the model training may be the same as the preset period in the data processing, or may be within one hour, two hours, one day, or the like before the sample data reception time.
In one embodiment, the label of the sample data sender may characterize the category of the sender, for example, 0 or 1 in two categories may be used, 0 may characterize the sender as a normal category, 1 may characterize the sender as an abnormal category, or 1,2, and 3 in multiple categories may be used to respectively characterize different categories, and the like. Training the preset classification model based on the seventh characteristic value and the eighth characteristic value, which may be to input the seventh characteristic value, the eighth characteristic value and the label into the preset classification model for training to obtain a classification value of the preset classification model.
In another embodiment, the training loss value may be determined according to a difference between the classification value and the class indicated by the label, the preset condition may be that the training loss value is lower than a preset value, and when the preset condition is satisfied, it may be considered that the classification accuracy of the preset classification model is higher, and the training may be stopped.
Therefore, the target classification model is trained on the basis of the seventh characteristic value, the eighth characteristic value and the label of the sample data representation sender category, and the supervised learning of the target classification model can be realized. On the basis, the training progress of the target classification model is determined based on the judgment of the classification value and the training loss value, so that the target classification model can learn the incidence relation between the input data characteristic value and the output label representing the data sender category, and further can be used for determining the sender category of the data to be classified.
In some embodiments, the S230 may include:
determining a time interval between every two adjacent data sending moments based on the distribution condition of the plurality of data sending moments of the sending party indicated by the second data sending and receiving condition;
and determining an eighth characteristic value of the sender based on the distribution situation of the time intervals.
In the embodiment of the present invention, the time interval distribution histogram corresponding to the data transmission time may be determined according to the second data transceiving condition. For example, the time when the sender i sends n data within a preset period is Ti,jAnd j is 1,2, …, n, the corresponding time interval is determined to be d'i,j=Ti,j+1-Ti,j. Accordingly, according to d'i,jMay determine a time interval distribution histogram, and further may determine an eighth feature value based on a distribution feature of the histogram.
In one embodiment, determining the eighth characteristic value of the sender based on the distribution of the time intervals may include determining the eighth characteristic value based on the number of the time intervals smaller than a preset threshold, or based on the distribution uniformity of the time intervals. For example, the eighth characteristic value and the like are determined based on the proportion of the number of time intervals smaller than 1min occupying the total time intervals within the preset period.
Therefore, based on the time interval distribution condition of data sending, the time sequence rule of the sample data sender when sending data can be better embodied, and the category of the sender can be more accurately distinguished.
In some embodiments, the determining an eighth characteristic value of the sender based on the distribution of the time intervals includes:
determining a sixth data volume sent at the data sending time after each time interval and a seventh data volume sent by the sender in the preset time period based on the distribution condition of the time intervals;
calculating a ratio of the sixth data amount to the seventh data amount;
and determining an eighth characteristic value of the sender based on the ratio and the natural logarithm of the ratio.
Here, the sixth data amount, the seventh data amount, and the eighth characteristic value may be calculated in the same manner as the third data amount, the fourth data amount, and the second characteristic value.
In some embodiments, the method further comprises:
constructing a weighted directed graph based on the association party which is indicated by the second data receiving and sending condition and has a data receiving and sending relation with the sender; one node in the weighted directed graph characterizes one of the senders or one of the associated parties;
generating a ninth feature value set based on the node information and/or the weight of the edge in the weighted directed graph; the edge represents a data receiving and sending relation between two nodes; the weight represents an eighth data volume sent by the starting node connected with the edge to the pointing node connected with the edge in the preset time period;
the determining the category of the sender based on the seventh feature value, the eighth feature value and a target classification model includes:
and inputting the seventh characteristic value, the eighth characteristic value and the ninth characteristic value set into a target classification model, and determining the category of the sender.
In some embodiments, the generating a ninth set of feature values based on the weights of the node information and/or edges in the weighted directed graph includes:
calculating a fourth characteristic value based on a third number of edges characterizing data sent by the sender and a fourth number of edges characterizing data received by the sender;
calculating a fifth characteristic value based on the weight of the edge representing the data sent by the sender;
calculating a sixth characteristic value based on the weight of an edge connected with a correspondent node of a related party belonging to the same IP address as the sender;
generating a ninth set of eigenvalues based on the fourth, fifth and sixth eigenvalues.
In some embodiments, the calculating a fifth eigenvalue based on the weight characterizing the edge of the data sent by the sender includes:
and calculating the weight average value of all edges representing the data sent by the sender in the weighted directed graph as a fifth characteristic value.
Here, the feature value calculation method in the ninth feature value set may be the same as the feature value calculation method in the third feature value set described above.
As shown in fig. 3, an embodiment of the present invention provides a data processing apparatus, including:
the acquiring unit 10 is configured to acquire a first data transceiving condition of a sender of data to be classified in a preset time period;
a determining unit 20, configured to determine a first characteristic value of the sender based on a first data volume sent by the sender at a first internet protocol IP address and a second data volume sent by the first IP address, where the first data transceiving condition indicates that the sender is sending data; the first IP address is an IP address for sending the data to be classified; determining a second characteristic value of the sender based on the distribution situation of the plurality of data sending moments of the sender indicated by the first data sending and receiving situation;
a classification unit 30, configured to determine a category of the sender based on the first feature value, the second feature value, and a target classification model; wherein the data processing modes for the different types of the senders are different.
In some embodiments, the determining unit 20 is specifically configured to:
determining a time interval between every two adjacent data sending moments based on the distribution condition of the plurality of data sending moments of the sending party indicated by the first data sending and receiving condition;
and determining a second characteristic value of the sender based on the distribution situation of the time intervals.
In some embodiments, the determining unit 20 is specifically configured to:
determining a third data volume sent at the data sending time after each time interval and a fourth data volume sent by the sender in the preset time period based on the distribution condition of the time intervals;
calculating a ratio of the third data amount to the fourth data amount;
and determining a second characteristic value of the sender based on the ratio and the natural logarithm of the ratio.
In some embodiments, the apparatus further comprises:
a construction unit, configured to construct a weighted directed graph based on the associated party indicated by the first data transceiving condition and having a data transceiving relationship with the sender; one node in the weighted directed graph characterizes one of the senders or one of the associated parties;
a generating unit, configured to generate a third feature value set based on node information and/or weights of edges in the weighted directed graph; the edge represents a data receiving and sending relation between two nodes; the weight represents a fifth data volume sent by the starting node connected with the edge to the pointing node connected with the edge in the preset time period;
the classification unit 30 is specifically configured to:
and inputting the first characteristic value, the second characteristic value and the third characteristic value set into a target classification model, and determining the category of the sender.
In some embodiments, the generating unit is specifically configured to:
calculating a fourth characteristic value based on a first number of edges characterizing data sent by the sender and a second number of edges characterizing data received by the sender;
calculating a fifth characteristic value based on the weight of the edge representing the data sent by the sender;
calculating a sixth characteristic value based on the weight of an edge connected with a correspondent node of a related party belonging to the same IP address as the sender;
generating a third set of eigenvalues based on the fourth, fifth and sixth eigenvalues.
In some embodiments, the generating unit is specifically configured to:
and calculating the weight average value of all edges representing the data sent by the sender in the weighted directed graph as a fifth characteristic value.
The embodiment of the invention provides a model training device, which is characterized by comprising the following components:
the second acquisition unit is used for acquiring a second data receiving and sending condition of a sender of the sample data in a preset time period;
a second determining unit, configured to determine a seventh characteristic value of the sender based on the data amount sent by the sender on a second IP address and the data amount sent by the second IP address, where the second data transceiving condition indicates; the second IP address is an IP address for sending the sample data; determining an eighth characteristic value of the sender based on the distribution situation of the plurality of data sending moments of the sender indicated by the second data sending and receiving situation;
the training unit is used for training a preset classification model based on the seventh characteristic value and the eighth characteristic value to obtain a classification value of the preset classification model; determining a training loss value according to the difference between the classification value and a label of a sender of the sample data; stopping training the preset classification model when the training loss value meets a preset condition; and when the training loss value does not meet the preset condition, continuing to train the classification model.
In some embodiments, the second determining unit is specifically configured to:
determining a time interval between every two adjacent data sending moments based on the distribution condition of the plurality of data sending moments of the sending party indicated by the second data sending and receiving condition;
and determining an eighth characteristic value of the sender based on the distribution situation of the time intervals.
One specific example is provided below in connection with any of the embodiments described above:
as shown in fig. 4, taking data as an example, in order to accurately detect spam, the embodiment of the present invention provides a spam detection method and system based on multi-feature analysis according to behavior and time data generated in the process of sending and receiving spam.
Suppose { MnIs a mail set, { S }iAre senders of these mails, { LiThe tags of these mailers (i ═ 1, 2.., N) take values of 1 and 0, indicating spammers and normal users, respectively. Each mail sender SiCorresponding to a K-dimensional feature vector describing its features. If with S ∈ RKRepresenting a collection of mailers, a mapping f: S- > {0,1} is required to accurately and uniquely mark each mailer. If the mail is sent by a spammer, the mail is the spam; if the mail is sent by a normal user, the mail is a normal mail.
S1, extracting characteristics based on behaviors
(1) A weighted directed graph G (V, E) is constructed to represent entities and relationships in the mail records. Where V represents a set of nodes, E represents a set of edges, V represents a set of nodesi,vjIs any 2 nodes in the graph, representing 2 mail addresses, eijOne slave v in the representative graphiDirection vjThe edge of (2). If the mail address A sends a mail to the mail address B, an edge pointing from A to B exists in the mail network, and the number of all mails sent from A to B represents the weight of the edge.
(2) Occupancy rate of mailbox to IP address: fm_IP=outm/outIPWherein outm=|Vik|,
Figure BDA0003442189430000171
They represent the mail address and the number of mails sent by the IP address, VikRepresenting the number of posts that node i sends on IP address k. For spammers who continually change mailbox addresses, Fm_IPThe value goes substantially towards 0; for normal mail addresses, Fm_IPThe distribution of values is relatively uniform.
(3) Other behavioral characteristics:
A. output degree: is recorded as | ViAnd | represents the number of edges going out from a node i.
B. The outgoing edge weight is first order distance: is marked as
Figure BDA0003442189430000172
And representing the weighted first moment of the edges which go out from a node i, wherein weight is the weight, and m is the number of the edges which go out.
C. Degree of entry: is noted as | { ejiAnd represents the number of edges entering a node i.
D. The recovery ratio is as follows: is noted as | { eji}|/|ViAnd | represents the proportion of the recovered edge to all the outgoing edges.
E. The total outgoing weight of the IP address: is marked as
Figure BDA0003442189430000173
Representing the sum of the edge weights of n nodes located at the same IP address k.
F. Occupancy rate of mailbox to IP address: is marked as
Figure BDA0003442189430000174
Representing the out-degree ratio of the mail address to the IP address where it is located.
Here, the parameters in the steps (2) and (3) together constitute a behavior feature.
S2, extracting time characteristics based on mail sending intervals
(1) And drawing a distribution histogram of the time intervals of the sent mails. Spammers aim to deliver as much spam as possibleFor different people, it is necessary to send spam to different email addresses continuously. Therefore, most mails of spammers are sent in a short time interval, the distribution of the mails has larger values in a small time interval and smaller values in other time interval, and the adjacent mail sending time intervals of normal users are randomly dispersed on the abscissa. Setting the e-mails sent by the e-mail address i in a period of time as m in sequencei,j(j ═ 1, 2.. times, n), with a corresponding transmission time ti,jThe time interval between sending of adjacent mails is di,j=ti,j+1-ti,jAnd obtaining a distribution histogram of the time interval according to the sending time interval.
(2) Calculating time characteristic F of mail sending interval according to histogramt:pDP (x) represents a ratio of the mail whose transmission time interval is D to the total transmitted mail. Distribution p based on mail sending interval according to definition of entropyDTime characteristic F oftCan be represented as Ft=-∑pD ln(pD). For spammers, F, due to their uneven distributiontIs relatively small, whereas for normal users, FtThe value of (a) is relatively large.
S3 mail classification using SVM
SVM is a machine learning algorithm with good generalization capability. It utilizes M labeled training samples { (x)1,y1),(x2,y2),...,(xM,yM) In which xiIs the n-dimensional feature corresponding to the ith sample; y isiIs the label corresponding to the ith sample. For the binary problem, yiTwo different classes can be represented by-1 and + 1.
Therefore, firstly, the behavior characteristics and the time characteristics of the training samples are extracted, and then the SVM is trained together with the corresponding labels to obtain a classification discrimination model. Finally, the discrimination model is used for finishing the classification of the spammer and the normal user so as to detect the spam. In the feature space, the mail binary classification problem is a linear nondifferential problem, so that a Support Vector Machine (SVM) based on a Gaussian kernel function is adopted for training and classification.
An embodiment of the present invention further provides an electronic device, where the electronic device includes: a processor and a memory for storing a computer program capable of running on the processor, the computer program when executed by the processor performing the steps of one or more of the methods described above.
An embodiment of the present invention further provides a computer-readable storage medium, where computer-executable instructions are stored in the computer-readable storage medium, and after being executed by a processor, the computer-executable instructions can implement the method according to one or more of the foregoing technical solutions.
The computer storage media provided by the present embodiments may be non-transitory storage media.
In the several embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. The above-described device embodiments are merely illustrative, for example, the division of the unit is only a logical functional division, and there may be other division ways in actual implementation, such as: multiple units or components may be combined, or may be integrated into another system, or some features may be omitted, or not implemented. In addition, the coupling, direct coupling or communication connection between the components shown or discussed may be through some interfaces, and the indirect coupling or communication connection between the devices or units may be electrical, mechanical or other forms.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, that is, may be located in one place, or may be distributed on a plurality of network units; some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, all the functional units in the embodiments of the present invention may be integrated into one processing module, or each unit may be separately used as one unit, or two or more units may be integrated into one unit; the integrated unit can be realized in a form of hardware, or in a form of hardware plus a software functional unit.
In some cases, any two of the above technical features may be combined into a new method solution without conflict.
In some cases, any two of the above technical features may be combined into a new device solution without conflict.
Those of ordinary skill in the art will understand that: all or part of the steps for implementing the method embodiments may be implemented by hardware related to program instructions, and the program may be stored in a computer readable storage medium, and when executed, the program performs the steps including the method embodiments; and the aforementioned storage medium includes: various media capable of storing program codes, such as a removable Memory device, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, and an optical disk.
The above description is only for the specific embodiments of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present invention, and all the changes or substitutions should be covered within the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the appended claims.

Claims (10)

1. A method of data processing, the method comprising:
acquiring a first data transceiving condition of a sender of data to be classified in a preset time period;
determining a first characteristic value of the sender based on a first data volume sent by the sender at a first Internet Protocol (IP) address and a second data volume sent by the first IP address, wherein the first data volume is indicated by the first data transceiving condition; the first IP address is an IP address for sending the data to be classified;
determining a second characteristic value of the sender based on the distribution situation of the plurality of data sending moments of the sender indicated by the first data sending and receiving situation;
determining a category of the sender based on the first feature value, the second feature value and a target classification model; wherein the data processing modes for the different types of the senders are different.
2. The method according to claim 1, wherein the determining a second characteristic value of the sender based on the distribution of the plurality of data transmission time instants of the sender indicated by the first data transceiving condition comprises:
determining a time interval between every two adjacent data sending moments based on the distribution condition of the plurality of data sending moments of the sending party indicated by the first data sending and receiving condition;
and determining a second characteristic value of the sender based on the distribution situation of the time intervals.
3. The method according to claim 2, wherein the determining a second eigenvalue of the sender based on the distribution of the time intervals comprises:
determining a third data volume sent at the data sending time after each time interval and a fourth data volume sent by the sender in the preset time period based on the distribution condition of the time intervals;
calculating a ratio of the third data amount to the fourth data amount;
and determining a second characteristic value of the sender based on the ratio and the natural logarithm of the ratio.
4. The method of claim 1, further comprising:
constructing a weighted directed graph based on the association party which is indicated by the first data receiving and sending condition and has a data receiving and sending relation with the sender; one node in the weighted directed graph characterizes one of the senders or one of the associated parties;
generating a third feature value set based on the node information and/or the weight of the edge in the weighted directed graph; the edge represents a data receiving and sending relation between two nodes; the weight represents a fifth data volume sent by the starting node connected with the edge to the pointing node connected with the edge in the preset time period;
the determining the category of the sender based on the first feature value, the second feature value and the target classification model includes:
and inputting the first characteristic value, the second characteristic value and the third characteristic value set into a target classification model, and determining the category of the sender.
5. The method according to claim 4, wherein generating a third set of eigenvalues based on weights of node information and/or edges in the weighted directed graph comprises:
calculating a fourth characteristic value based on a first number of edges characterizing data sent by the sender and a second number of edges characterizing data received by the sender;
calculating a fifth characteristic value based on the weight of the edge representing the data sent by the sender;
calculating a sixth characteristic value based on the weight of an edge connected with a correspondent node of a related party belonging to the same IP address as the sender;
generating a third set of eigenvalues based on the fourth, fifth and sixth eigenvalues.
6. The method of claim 5, wherein the calculating a fifth eigenvalue based on the weight characterizing the edge of the data sent by the sender comprises:
and calculating the weight average value of all edges representing the data sent by the sender in the weighted directed graph as a fifth characteristic value.
7. A method of model training, the method comprising:
acquiring a second data receiving and sending condition of a sender of the sample data in a preset time period;
determining a seventh characteristic value of the sender based on the data volume sent by the sender on the second IP address and the data volume sent by the second IP address indicated by the second data transceiving condition; the second IP address is an IP address for sending the sample data;
determining an eighth characteristic value of the sender based on the distribution situation of the plurality of data sending moments of the sender indicated by the second data sending and receiving situation;
training a preset classification model based on the seventh characteristic value and the eighth characteristic value to obtain a classification value of the preset classification model;
determining a training loss value according to the difference between the classification value and a label of a sender of the sample data;
stopping training the preset classification model when the training loss value meets a preset condition;
and when the training loss value does not meet the preset condition, continuing to train the classification model.
8. The method according to claim 7, wherein the determining an eighth eigenvalue of the sender based on the distribution of the plurality of data transmission time instants of the sender indicated by the second data transmission/reception condition comprises:
determining a time interval between every two adjacent data sending moments based on the distribution condition of the plurality of data sending moments of the sending party indicated by the second data sending and receiving condition;
and determining an eighth characteristic value of the sender based on the distribution situation of the time intervals.
9. An electronic device, characterized in that the electronic device comprises: a processor and a memory for storing a computer program capable of running on the processor; wherein,
the processor, when executing the computer program, performs the steps of the method of any of claims 1 to 8.
10. A computer-readable storage medium having stored thereon computer-executable instructions; the computer-executable instructions, when executed by a processor, are capable of implementing the method of any one of claims 1 to 8.
CN202111681609.6A 2021-12-29 2021-12-29 Data processing method, model training method, electronic device, and storage medium Pending CN114389872A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111681609.6A CN114389872A (en) 2021-12-29 2021-12-29 Data processing method, model training method, electronic device, and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111681609.6A CN114389872A (en) 2021-12-29 2021-12-29 Data processing method, model training method, electronic device, and storage medium

Publications (1)

Publication Number Publication Date
CN114389872A true CN114389872A (en) 2022-04-22

Family

ID=81199764

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111681609.6A Pending CN114389872A (en) 2021-12-29 2021-12-29 Data processing method, model training method, electronic device, and storage medium

Country Status (1)

Country Link
CN (1) CN114389872A (en)

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102356611A (en) * 2009-03-16 2012-02-15 雅马哈株式会社 Relay device, setting update method, and program
CN102404341A (en) * 2011-12-22 2012-04-04 中标软件有限公司 Method and device for monitoring user behavior of e-mail
US20120131107A1 (en) * 2010-11-18 2012-05-24 Microsoft Corporation Email Filtering Using Relationship and Reputation Data
CN102857404A (en) * 2011-06-30 2013-01-02 厦门三五互联科技股份有限公司 Device and method for spam detection based on email fingerprint features
CN103078752A (en) * 2012-12-27 2013-05-01 华为技术有限公司 Method, device and equipment for detecting e-mail attack
CN104506356A (en) * 2014-12-24 2015-04-08 网易(杭州)网络有限公司 Method and device for determining credibility of IP (Internet protocol) address
CN108880990A (en) * 2018-06-14 2018-11-23 深信服科技股份有限公司 Detect method, system, device and the readable storage medium storing program for executing of outgoing spam

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102356611A (en) * 2009-03-16 2012-02-15 雅马哈株式会社 Relay device, setting update method, and program
US20120131107A1 (en) * 2010-11-18 2012-05-24 Microsoft Corporation Email Filtering Using Relationship and Reputation Data
CN102857404A (en) * 2011-06-30 2013-01-02 厦门三五互联科技股份有限公司 Device and method for spam detection based on email fingerprint features
CN102404341A (en) * 2011-12-22 2012-04-04 中标软件有限公司 Method and device for monitoring user behavior of e-mail
CN103078752A (en) * 2012-12-27 2013-05-01 华为技术有限公司 Method, device and equipment for detecting e-mail attack
CN104506356A (en) * 2014-12-24 2015-04-08 网易(杭州)网络有限公司 Method and device for determining credibility of IP (Internet protocol) address
CN108880990A (en) * 2018-06-14 2018-11-23 深信服科技股份有限公司 Detect method, system, device and the readable storage medium storing program for executing of outgoing spam

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
张洪;段海新;吴建平;: "基于IP地址聚类的反垃圾邮件信誉系统", 清华大学学报(自然科学版), no. 10, pages 1 - 5 *

Similar Documents

Publication Publication Date Title
JP4742618B2 (en) Information processing system, program, and information processing method
US8224905B2 (en) Spam filtration utilizing sender activity data
JP4742619B2 (en) Information processing system, program, and information processing method
KR100992220B1 (en) Spam Detector with Challenges
US8335383B1 (en) Image filtering systems and methods
Renuka et al. Spam classification based on supervised learning using machine learning techniques
US8370930B2 (en) Detecting spam from metafeatures of an email message
US20070038705A1 (en) Trees of classifiers for detecting email spam
CN109889436B (en) Method for discovering spammer in social network
Govil et al. A machine learning based spam detection mechanism
CN111835622B (en) Information interception method, device, computer equipment and storage medium
Qaroush et al. Identifying spam e-mail based-on statistical header features and sender behavior
Bouguessa An unsupervised approach for identifying spammers in social networks
Bhat et al. Classification of email using BeaKS: Behavior and keyword stemming
Sharma et al. E-Mail Spam Detection Using SVM and RBF.
CN111221970A (en) Mail classification method and device based on behavior structure and semantic content joint analysis
Iyengar et al. Integrated spam detection for multilingual emails
CN103198396A (en) Mail classification method based on social network behavior characteristics
Anitha et al. Email spam classification using neighbor probability based Naïve Bayes algorithm
CN114389872A (en) Data processing method, model training method, electronic device, and storage medium
Sumithra et al. Probability-based Naïve Bayes Algorithm for Email Spam Classification
CN110557352A (en) Method, device and equipment for detecting mass-sending junk mails
CN114091586A (en) Account identification model determining method, device, equipment and medium
Abhinav et al. Spam Mail Detection using Machine Learning
Hershkop et al. Identifying spam without peeking at the contents

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
AD01 Patent right deemed abandoned
AD01 Patent right deemed abandoned

Effective date of abandoning: 20240517