CN112116018A - Sample classification method, apparatus, computer device, medium, and program product - Google Patents

Sample classification method, apparatus, computer device, medium, and program product Download PDF

Info

Publication number
CN112116018A
CN112116018A CN202011028622.7A CN202011028622A CN112116018A CN 112116018 A CN112116018 A CN 112116018A CN 202011028622 A CN202011028622 A CN 202011028622A CN 112116018 A CN112116018 A CN 112116018A
Authority
CN
China
Prior art keywords
sample
malicious
evaluation value
prediction result
unknown
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011028622.7A
Other languages
Chinese (zh)
Inventor
刘爽
白皓文
白敏�
汪列军
黄朝文
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Qianxin Technology Group Co Ltd
Secworld Information Technology Beijing Co Ltd
Original Assignee
Qianxin Technology Group Co Ltd
Secworld Information Technology Beijing Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Qianxin Technology Group Co Ltd, Secworld Information Technology Beijing Co Ltd filed Critical Qianxin Technology Group Co Ltd
Priority to CN202011028622.7A priority Critical patent/CN112116018A/en
Publication of CN112116018A publication Critical patent/CN112116018A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2413Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on distances to training or reference patterns
    • G06F18/24133Distances to prototypes
    • G06F18/24143Distances to neighbourhood prototypes, e.g. restricted Coulomb energy networks [RCEN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/243Classification techniques relating to the number of classes
    • G06F18/24323Tree-organised classifiers

Abstract

The present disclosure provides a sample classification method, comprising: obtaining an unknown sample; determining a feature vector of an unknown sample; processing the feature vector by utilizing a pre-constructed classification model to obtain a class prediction result aiming at an unknown sample and a prediction result of an adjacent sample, wherein the class prediction result comprises a malicious sample class label; and determining an evaluation value of the malicious sample class label based on the unknown sample and the adjacent sample prediction result. The present disclosure also provides a sample sorting apparatus, a computer device, a medium, and a program product.

Description

Sample classification method, apparatus, computer device, medium, and program product
Technical Field
The present disclosure relates to the field of network security technologies, and in particular, to a sample classification method, apparatus, computer device, medium, and program product.
Background
The rapid development of the internet brings convenience to people and brings much trouble, and the flooding of malicious samples (such as malicious programs, malicious files, malicious information and the like) is one of the problems. With the continuous improvement of computer technology, the identification and classification of malicious samples by machine learning gradually appears in recent years. However, due to the influence of various factors (such as the accuracy of feature extraction, the adaptation degree of the used machine learning algorithm, etc.), the classification result of the malicious sample is not accurate enough, and the probability of false alarm generated by the classification model is high. Therefore, methods for suppressing false alarm are needed to be found, and the classification accuracy for malicious samples is improved.
Disclosure of Invention
The present disclosure provides a sample classification method, apparatus, computer device, medium, and program product.
One aspect of the present disclosure provides a sample classification method, including: obtaining an unknown sample; determining a feature vector of an unknown sample; processing the feature vector by utilizing a pre-constructed classification model to obtain a class prediction result aiming at an unknown sample and a prediction result of an adjacent sample, wherein the class prediction result comprises a malicious sample class label; and determining an evaluation value of the malicious sample class label based on the unknown sample and the adjacent sample prediction result.
Optionally, the neighboring sample prediction result comprises: a nearest neighbor sample of the unknown sample, the nearest neighbor sample comprising: known samples having a distance between the unknown sample and the training set of classification models that is less than a first threshold. The determining the evaluation value of the malicious sample class label based on the unknown sample and the prediction result of the adjacent sample comprises: acquiring content information of at least one item in an unknown sample; acquiring content information of at least one item in the nearest sample; for each item in the at least one item, calculating the similarity between the content information of the item in the unknown sample and the content information of the item in the nearest sample; determining a first evaluation value of a malicious sample category label based on the respective similarity of the at least one item; and determining an evaluation value of the malicious sample category label based on the first evaluation value.
Optionally, the determining, based on the respective similarity of the at least one item, the first evaluation value of the malicious sample category tag includes: and respectively carrying out weighted summation on the similarity of the at least one item by utilizing the preset weight of the at least one item so as to calculate a first evaluation value of the malicious sample class label.
Optionally, the at least one item includes at least one of: program database files, domain names, URL addresses, and IP addresses.
Optionally, the category prediction result further includes: confidence for malicious sample class labels. The neighboring sample prediction further comprises: the distance between the nearest neighbor sample and the unknown sample. The determining the evaluation value of the malicious sample category label based on the first evaluation value includes: taking the difference between the confidence coefficient and the distance as a second evaluation value of the malicious sample class label; and taking the sum of the first evaluation value and the second evaluation value as an evaluation value of the malicious sample class label.
Optionally, the method further includes: when the evaluation value of the malicious sample class label is smaller than a third threshold value, determining that the malicious sample class label is wrong; and when the evaluation value of the malicious sample class label is larger than or equal to a third threshold value, determining that the malicious sample class label is correct.
Optionally, the processing the feature vector by using the pre-constructed classification model includes: processing the feature vectors based on a random forest algorithm to determine a category prediction result; and processing the feature vectors based on a k-neighbor algorithm to determine a neighbor sample predictor.
Another aspect of the present disclosure provides a sample classification device, including: the device comprises an acquisition module, a feature determination module, a classification module and an evaluation module. The acquisition module is used for acquiring an unknown sample. The feature determination module is used for determining a feature vector of the unknown sample. The classification module is used for processing the feature vector by utilizing a pre-constructed classification model to obtain a class prediction result and a neighboring sample prediction result aiming at the unknown sample, wherein the class prediction result comprises a malicious sample class label. The evaluation module is used for determining an evaluation value of the malicious sample class label based on the unknown sample and the prediction result of the adjacent sample.
Another aspect of the disclosure provides a computer-readable storage medium having stored thereon executable instructions that, when executed by a processor, cause the processor to perform a method as in any of the embodiments above.
Another aspect of the disclosure provides a computer device comprising a memory, a processor, and a computer program stored on the memory and executable on the processor. The processor, when executing the computer program, is adapted to carry out the method according to any of the embodiments described above.
Another aspect of the disclosure provides a computer program product comprising computer readable instructions. Wherein the computer readable instructions are adapted to perform a method as described in any of the embodiments above when executed.
According to the technical scheme of the embodiment of the disclosure, the feature vector capable of representing the sample characteristics from multiple dimensions is extracted for an unknown sample, and the class prediction result and the adjacent sample prediction result for the unknown sample are obtained after the feature vector of the unknown sample is processed by utilizing a pre-constructed classification model. The accuracy of the classification model for the malicious sample category predicted by the unknown sample is evaluated based on the prediction result of the adjacent sample and the unknown sample, so that the false alarm suppression of the category prediction result of the classification model is realized, and the accuracy of the sample classification result is further improved.
Drawings
For a more complete understanding of the present disclosure and the advantages thereof, reference is now made to the following descriptions taken in conjunction with the accompanying drawings, in which:
fig. 1 schematically illustrates an application scenario of a sample classification method and apparatus according to an embodiment of the present disclosure;
FIG. 2 schematically illustrates a flow diagram of a sample classification method according to an embodiment of the disclosure;
FIG. 3 schematically illustrates a flow chart of a sample classification method according to another embodiment of the present disclosure;
FIG. 4 schematically illustrates a flow chart of a sample classification method according to another embodiment of the present disclosure;
FIG. 5 schematically illustrates an example diagram of a sample classification process according to an embodiment of the disclosure;
FIG. 6 schematically illustrates a block diagram of a sample classification device according to an embodiment of the present disclosure; and
FIG. 7 schematically shows a block diagram of a computer device according to an embodiment of the disclosure.
Detailed Description
Hereinafter, embodiments of the present disclosure will be described with reference to the accompanying drawings. It should be understood that the description is illustrative only and is not intended to limit the scope of the present disclosure. In the following detailed description, for purposes of explanation, numerous specific details are set forth in order to provide a thorough understanding of the embodiments of the disclosure. It may be evident, however, that one or more embodiments may be practiced without these specific details. Moreover, in the following description, descriptions of well-known structures and techniques are omitted so as to not unnecessarily obscure the concepts of the present disclosure.
The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the disclosure. The terms "comprises," "comprising," and the like, as used herein, specify the presence of stated features, steps, operations, and/or components, but do not preclude the presence or addition of one or more other features, steps, operations, or components.
All terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art unless otherwise defined. It is noted that the terms used herein should be interpreted as having a meaning that is consistent with the context of this specification and should not be interpreted in an idealized or overly formal sense.
Where a convention analogous to "at least one of A, B and C, etc." is used, in general such a construction is intended in the sense one having skill in the art would understand the convention (e.g., "a system having at least one of A, B and C" would include but not be limited to systems that have a alone, B alone, C alone, a and B together, a and C together, B and C together, and/or A, B, C together, etc.). Where a convention analogous to "A, B or at least one of C, etc." is used, in general such a construction is intended in the sense one having skill in the art would understand the convention (e.g., "a system having at least one of A, B or C" would include but not be limited to systems that have a alone, B alone, C alone, a and B together, a and C together, B and C together, and/or A, B, C together, etc.).
Some block diagrams and/or flow diagrams are shown in the figures. It will be understood that some blocks of the block diagrams and/or flowchart illustrations, or combinations thereof, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the instructions, which execute via the processor, create means for implementing the functions/acts specified in the block diagrams and/or flowchart block or blocks. The techniques of this disclosure may be implemented in hardware and/or software (including firmware, microcode, etc.). In addition, the techniques of this disclosure may take the form of a computer program product on a computer-readable storage medium having instructions stored thereon for use by or in connection with an instruction execution system.
The embodiment of the disclosure provides a sample classification method and device executed by computer equipment. The method may include an acquisition process, a feature determination process, a classification process, and an evaluation process. During the acquisition process, an unknown sample is acquired. In the feature determination process, a feature vector of the unknown sample is determined. In the classification process, the feature vectors are processed by utilizing a pre-constructed classification model to obtain a class prediction result and a neighboring sample prediction result aiming at an unknown sample, wherein the class prediction result comprises a malicious sample class label. In the evaluation process, an evaluation value of the malicious sample class label is determined based on the unknown sample and the prediction result of the adjacent sample.
Fig. 1 schematically illustrates an application scenario of the sample classification method and apparatus according to an embodiment of the present disclosure. It should be noted that fig. 1 is only an example of a scenario in which the embodiments of the present disclosure may be applied to help those skilled in the art understand the technical content of the present disclosure, but does not mean that the embodiments of the present disclosure may not be applied to other devices, systems, environments or scenarios.
As shown in fig. 1, the application scenario may include terminal devices 101, 102, 103, a network 104 and a server/server cluster 105.
The terminal devices 101, 102, 103 may be various electronic devices, such as smart phones, tablets, portable computers, desktop computers, and the like. Various applications may be installed in the terminal apparatuses 101, 102, 103 to perform different functions. The server/server cluster 105 may have a greater computing power than the terminal devices 101, 102, 103. The server/server cluster 105 may provide various background support services for the end devices 101, 102, 103. Network 104 is the medium used to provide communication links between terminal devices 101, 102, 103 and server/server cluster 105.
The sample classification method according to the embodiments of the present disclosure may be performed by the terminal devices 101, 102, 103, and accordingly, the sample classification apparatus according to the embodiments of the present disclosure may be provided in the terminal devices 101, 102, 103. In addition, the sample classification method according to the embodiment of the present disclosure may also be executed by the server/server cluster 105, and accordingly, the sample classification apparatus according to the embodiment of the present disclosure may be disposed in the server/server cluster 105.
It should be understood that the number of end devices, networks, and server/server clusters in fig. 1 is merely illustrative. There may be any number of end devices, networks, and server/server clusters, as desired for implementation.
The rapid development of the internet brings convenience to people and brings much trouble, and the flooding of malicious samples (such as malicious programs, malicious files, malicious information and the like) is one of the problems. With the continuous improvement of computer technology, the identification and classification of malicious samples by machine learning gradually appears in recent years. However, due to the influence of various factors (such as the accuracy of feature extraction, the adaptation degree of the used machine learning algorithm, etc.), the classification result of the malicious sample is not accurate enough, and the probability of false alarm generated by the classification model is high. Therefore, methods for suppressing false alarm are needed to be found, and the classification accuracy for malicious samples is improved.
According to the embodiment of the disclosure, a sample classification method is provided, which introduces false alarm suppression in the sample classification process so as to improve the classification accuracy degree for malicious samples. The method is exemplified below with reference to the figures. It should be noted that the sequence numbers of the respective steps in the following methods are merely used as representations of the steps for description, and should not be construed as representing the execution order of the respective steps. The method need not be performed in the exact order shown, unless explicitly stated.
Fig. 2 schematically illustrates a flow diagram of a sample classification method according to an embodiment of the present disclosure, which may be performed by various types of computer devices.
As shown in fig. 2, the method may include operations S210 to S240.
In operation S210, an unknown sample is acquired.
Illustratively, the unknown sample is a sample of the category to be predicted, and the sample can be in various forms such as programs, files, information, and the like.
In operation S220, a feature vector of the unknown sample is determined.
Illustratively, the full-scale information extraction (which may be referred to as gene extraction) may be performed on the unknown sample, for example, the unknown sample may be subjected to static analysis and/or dynamic analysis to obtain a plurality of pieces of information in the unknown sample, and then the feature information of the unknown sample may be determined based on the plurality of pieces of information. Taking an unknown sample program as an example, without running the unknown sample program, the source code or binary machine code of the unknown sample program is statically scanned and analyzed, for example, PE (Portable Executable) is analyzed to obtain a plurality of items of static information. And, in the case of executing the unknown sample program, the actual execution path and the running state of the program are monitored and tracked, thereby extracting information related to dynamic behavior such as data flow.
According to embodiments of the present disclosure, the full amount of information extracted from an unknown sample may more comprehensively characterize the traits of the unknown sample from multiple dimensions, which information constitutes a genetic map of the unknown sample. Illustratively, the full amount information may include numerical class information and fixed class information, wherein the numerical class information may include, for example: array size, dictionary element number, length of predetermined character string, word frequency statistical result of predetermined item, word length statistical result, etc. The fixed class information may include, for example: the file sub-stream maximum value, the file sub-stream minimum value, the PE sub-stream number, the PDF sub-stream number, the section information, the stream type, the compiler type, the length information and the content information of the PDB (Program Database) file, the PNG type picture number, the JPG type picture number, the OLE (Object Linking and Embedding) Object number, the API (Application Programming Interface) call number, the registry operation number, the release file number, the IP address, the domain name, the URL address, the mutex, the document author information, and the like.
Illustratively, after extracting the full amount of information of the unknown sample, static information of the interference, such as a time stamp, content information of the PDB file, some information of the non-PE, etc., can be excluded therefrom to get a filtered information set. And converting and mapping the characteristic values on the basis of the information set to obtain a plurality of numerical characteristics and fixed characteristics, combining the numerical characteristics and the fixed characteristics to generate a 1-x-n dimensional characteristic vector used by a machine learning algorithm, wherein n is an integer greater than 1. Compared with the traditional method of extracting features only based on bag of words, N-Gram (N-element model) and the like, the feature vector determining method has higher universality and can give consideration to different types of sample files.
Having illustrated an exemplary embodiment for determining a feature vector for an unknown sample, reference is now continued with respect to FIG. 2.
In operation S230, the feature vector is processed using a pre-constructed classification model to obtain a class predictor and a neighboring sample predictor for the unknown sample.
For classification of malicious samples, a training classification model is usually constructed in advance. For example, a large number of known samples may be selected, a feature vector of each known sample may be determined, and a malicious sample category to which each known sample belongs may be labeled, including, for example, the following labels: APT (Advanced Persistent Threat), CVE (Common Vulnerabilities & Exposures), malicious family code, and the like. And forming a training set by the feature vectors and the labels of the known samples. And training the initial classification model by using a training set to obtain a target classification model, namely a pre-constructed classification model.
In determining the feature vector of each known sample, according to the embodiments of the present disclosure, the method may be implemented in the manner of determining the feature vector of the unknown sample as described above. When the initial classification model is trained by using the training set, for example, multiple machine learning algorithms can be combined, supervised training is performed by taking the advantages of the multiple machine learning algorithms into consideration, and parameters of the classification model are iteratively optimized. And determining that the training is finished when the convergence is realized, and obtaining a target classification model. In the present operation S230, when the target classification model obtained by training using the training set is applied for classification, a class prediction result and a prediction result of a neighboring sample may be output for an unknown sample. Illustratively, the class prediction result includes a malicious sample class label for characterizing a malicious sample class to which the predicted unknown sample belongs, and the neighboring sample prediction result may characterize known samples associated with the unknown sample in the training set.
Having illustrated an exemplary embodiment for determining a feature vector for an unknown sample, reference is now continued with respect to FIG. 2.
In operation S240, an evaluation value of a malicious sample class label is determined based on the unknown sample and the neighboring sample prediction result.
Illustratively, the evaluated value of the malicious sample class label is used for characterizing the accuracy of the malicious sample class label. In operation S240, the accuracy of the class prediction result output by the classification model for the unknown sample is fed back based on the unknown sample and the prediction result of the neighboring sample output by the classification model for the unknown sample, so that whether the classification model is false-positive or not can be determined.
As can be understood by those skilled in the art, the sample classification method according to the embodiment of the present disclosure extracts, for an unknown sample, a feature vector that can characterize the characteristics of the sample from multiple dimensions, and processes the feature vector of the unknown sample by using a pre-constructed classification model to obtain a class prediction result and a neighboring sample prediction result for the unknown sample. The accuracy of the classification model for the malicious sample category predicted by the unknown sample is evaluated based on the prediction result of the adjacent sample and the unknown sample, so that the false alarm suppression of the category prediction result of the classification model is realized, and the accuracy of the sample classification result is further improved.
According to an embodiment of the present disclosure, in the above training of the initial classification model by using the training set, for example, a Random Forest (Random Forest) algorithm and a k-neighborhood (k-Nearest Neighbor, kNN/IBk) algorithm may be combined for training to generate a pre-constructed classification model. The classification model can output a class prediction result aiming at an unknown sample, and also supports outputting k known samples which are closest to the unknown sample in a training set aiming at the unknown sample, wherein the k known samples can be called the closest samples of the unknown sample, and k is a positive integer. On this basis, the process of processing the feature vector by using the pre-constructed classification model may include: processing the feature vector of the unknown sample based on a random forest algorithm to determine a category prediction result; and processing the feature vector of the unknown sample based on a k-neighborhood algorithm to determine a neighborhood sample prediction result. In the process, the random forest algorithm and the IBk algorithm are combined, so that malicious sample detection with high precision and high accuracy can be realized. On the basis, false alarm suppression is further performed based on the recognition result of the classification model, and a more accurate malicious sample detection result can be obtained.
Illustratively, the neighbor sample predictor for the unknown sample may include: the nearest neighbor of the unknown sample. The nearest neighbor sample may be, for example, a known sample whose distance from the unknown sample in the training set of the classification model is less than a first threshold. For example, in the process of processing the feature vector of the unknown sample, the classification model calculates a Distance (e.g., Euclidean Distance (Euclidean Distance), Mahalanobis Distance (Mahalanobis Distance), etc.) between the feature vector of each known sample in the training set and the feature vector of the unknown sample, so as to obtain a Distance vector. A nearest neighbor sample of the unknown sample may be determined based on the distance vector.
Fig. 3 schematically shows a flowchart of a sample classification method according to another embodiment of the present disclosure, for exemplarily illustrating a process of determining an evaluation value of a malicious sample class label of an unknown sample.
As shown in fig. 3, the above process of determining the evaluation value of the malicious sample class label based on the unknown sample and the neighboring sample prediction result may include operations S341 to S344.
In operation S341, content information of at least one item in the unknown sample is acquired.
In operation S342, content information of the at least one item in the nearest sample is acquired.
For example, since information of a plurality of items of the unknown sample has been extracted in the above process of determining the feature vector of the unknown sample, operation S341 may obtain content information of the at least one item from the extracted information. Similarly, in the process of determining the feature vector of each known sample, information of a plurality of items of the known sample is extracted, and the operation S342 may also obtain content information of the at least one item from the extracted information. In other embodiments, the unknown sample and the nearest neighboring sample may be scan-analyzed in operations S341 to S342 to obtain the content information of at least one item. The "item" in the at least one item may refer to a specific file, a specific attribute of a certain type, and the like in the unknown sample. For example, the at least one item may include at least one of: PDB file, domain name, URL address, IP address, etc. The information of each item may include content information and length information, and the present embodiment needs to acquire the content information of each item.
In operation S343, for each of the at least one item, a similarity between the content information of the item in the unknown sample and the content information of the item in the nearest sample is calculated.
For example, the at least one item includes: PDB file, domain name, URL address, and IP address. Calculating the similarity x between the PDB string in the unknown sample and the PDB string in the nearest neighbor sample1Calculating the similarity x between the domain name string in the unknown sample and the domain name string in the nearest sample2Calculating the similarity x between the URL address string in the unknown sample and the URL address string in the nearest sample3Calculating the similarity x between the IP address string in the unknown sample and the IP address string in the nearest sample4. By comparing the similarity between the content information of each item in the nearest neighbor sample predicted by the unknown sample and the classification model, the incidence relation between the unknown sample and the nearest neighbor sample predicted by the classification model can be analyzed from another dimension, so that the accuracy of the nearest neighbor sample predicted by the classification model can be obtained, and the accuracy of the nearest neighbor sample directly influences the accuracy of the category prediction result. Illustratively, a string similarity algorithm such as edit Distance (Jaro Distance) may be utilized to calculate the similarity between the content information of the same item in the unknown sample and the nearest neighboring sample.
In operation S344, a first evaluation value of the malicious sample category label is determined based on the respective similarity of the at least one item.
Continuing the example above, based on the above similarity x1Similarity x2Similarity x3And similarity x4A first evaluation value of a malicious sample class label of the unknown sample is determined. According to an embodiment of the present disclosure, the process may include: and respectively carrying out weighted summation on the similarity of the at least one item by utilizing the preset weight of the at least one item so as to calculate a first evaluation value of the malicious sample class label. For example, if the weights of the PDB file, the domain name, the URL address and the IP address are α, β, γ and respectively, the first evaluation value S of the malicious sample category label of the unknown sample can be calculated according to the formula (1)1
S1=α×x1+β×x2+γ×x3+×x4
Formula (1)
After the first evaluation value is calculated, the first evaluation value can be directly used as an evaluation value of a malicious sample class label output by the classification model for the unknown sample. Alternatively, further processing may be performed on the basis of the first evaluation value to obtain an evaluation value for the malicious sample category tag.
The following describes an exemplary procedure of continuing the processing based on the first evaluation value to obtain an evaluation value of the malicious sample class tag.
Optionally, the class prediction result further comprises a confidence C for the malicious sample class label, which is proportional to the accuracy of the output result of the classification model. The neighboring sample prediction result further includes a distance L between the nearest neighboring sample and the unknown sample, which is inversely proportional to the accuracy of the output result of the classification model. Wherein between the unknown sample and the nearest neighbor sample
With continued reference to fig. 3, the above process of determining an evaluation value of a malicious sample class label based on the unknown sample and the prediction result of the neighboring sample may further include operations S345 to S346.
In operation S345, the difference between the confidence C in the class prediction result for the unknown sample and the distance L described above is used as the second evaluation value of the malicious sample class label. Exemplarily, the second evaluation value S2The calculation can be made according to equation (2).
S2=C-L
Formula (2)
In operation S346, the sum of the first evaluation value and the second evaluation value is used as the evaluation value of the malicious sample category tag. The evaluation value S may be calculated according to equation (3), for example.
S=S1+S2
Formula (3)
In other embodiments, the second evaluation value may also be directly used as the evaluation value of the malicious sample class label of the unknown sample.
Fig. 4 schematically shows a flow chart of a sample classification method according to another embodiment of the present disclosure.
As shown in fig. 4, the method may include operations S410 to S470.
In operation S410, an unknown sample is acquired.
In operation S420, a feature vector of an unknown sample is determined.
In operation S430, the feature vectors are processed using a pre-constructed classification model to obtain class predictors and neighbor sample predictors for the unknown sample.
In operation S440, an evaluation value of a malicious sample class label is determined based on the unknown sample and the neighboring sample prediction result.
The above operations S410 to S440 are similar to the above embodiments of the operations S210 to S240, and are not described herein again.
In operation S450, it is determined whether the evaluation value is less than a third threshold value. If so, operation S460 is performed. If not, operation S470 is performed.
In operation S460, it is determined that the malicious sample class label is incorrect, and the output result of the classification model may be determined to be a false alarm.
In operation S470, it is determined that the malicious sample class label is correct, and it may be determined that the output result of the classification model is correct, or not false.
Fig. 5 schematically illustrates an example diagram of a sample classification process according to an embodiment of the disclosure.
As shown in fig. 5, an unknown sample 501 is preprocessed to extract a feature vector 502, the feature vector 502 is input to a pre-constructed classification model 503, and the classification model outputs a prediction result for the unknown sample 501: malicious sample class label 504, nearest neighbor sample 505, confidence 506, and distance information 507 between the nearest sample and the unknown sample. In one aspect, the first estimate 508 may be derived based on a similarity between content information of at least one item in the nearest neighbor sample 505 and the unknown sample 501. On the other hand, the difference between the confidence 506 and the distance information 507 is taken as a second estimated value 509. The above-mentioned process of calculating the first estimation value 508 and the second estimation value 509 is described above and will not be described herein. The first estimate 508 and the second estimate 509 are added to obtain a final estimate 510. Based on the final estimated value 510, the false alarm phenomenon of the classification model 503 can be fed back, so that false alarm suppression is realized, and the accuracy of the classification result for the unknown sample can be further improved.
Fig. 6 schematically illustrates a block diagram of a sample classification apparatus according to an embodiment of the present disclosure, which may be provided in various types of computer devices.
As shown in fig. 6, the sample classification apparatus 600 may include: an acquisition module 610, a feature determination module 620, a classification module 630, and an evaluation module 640.
The acquisition module 610 is used to acquire an unknown sample.
The feature determination module 620 is used to determine a feature vector of the unknown sample.
The classification module 630 is configured to process the feature vector using a pre-constructed classification model to obtain a class prediction result for the unknown sample and a neighboring sample prediction result, where the class prediction result includes a malicious sample class label.
The evaluation module 640 is configured to determine an evaluation value of the malicious sample class label based on the unknown sample and the neighboring sample prediction result.
It should be noted that the implementation, solved technical problems, implemented functions, and achieved technical effects of each module/unit/subunit and the like in the apparatus part embodiment are respectively the same as or similar to the implementation, solved technical problems, implemented functions, and achieved technical effects of each corresponding step in the method part embodiment, and are not described herein again.
Any number of modules, sub-modules, units, sub-units, or at least part of the functionality of any number thereof according to embodiments of the present disclosure may be implemented in one module. Any one or more of the modules, sub-modules, units, and sub-units according to the embodiments of the present disclosure may be implemented by being split into a plurality of modules. Any one or more of the modules, sub-modules, units, sub-units according to embodiments of the present disclosure may be implemented at least in part as a hardware circuit, such as a Field Programmable Gate Array (FPGA), a Programmable Logic Array (PLA), a system on a chip, a system on a substrate, a system on a package, an Application Specific Integrated Circuit (ASIC), or may be implemented in any other reasonable manner of hardware or firmware by integrating or packaging a circuit, or in any one of or a suitable combination of software, hardware, and firmware implementations. Alternatively, one or more of the modules, sub-modules, units, sub-units according to embodiments of the disclosure may be at least partially implemented as a computer program module, which when executed may perform the corresponding functions.
Fig. 7 schematically shows a block diagram of a computer device adapted to implement the above described method according to an embodiment of the present disclosure. The computer device shown in fig. 7 is only an example and should not bring any limitation to the function and scope of use of the embodiments of the present disclosure.
As shown in fig. 7, computer device 700 includes a processor 710 and a computer-readable storage medium 720. The computer device 700 may perform a method according to an embodiment of the present disclosure.
In particular, processor 710 may comprise, for example, a general purpose microprocessor, an instruction set processor and/or associated chipset, and/or a special purpose microprocessor (e.g., an Application Specific Integrated Circuit (ASIC)), and/or the like. The processor 710 may also include on-board memory for caching purposes. Processor 710 may be a single processing unit or a plurality of processing units for performing the different actions of the method flows according to embodiments of the present disclosure.
Computer-readable storage medium 720, for example, may be a non-volatile computer-readable storage medium, specific examples including, but not limited to: magnetic storage devices, such as magnetic tape or Hard Disk Drives (HDDs); optical storage devices, such as compact disks (CD-ROMs); a memory, such as a Random Access Memory (RAM) or a flash memory; and so on.
The computer-readable storage medium 720 may include a computer program 721, which computer program 721 may include code/computer-executable instructions that, when executed by the processor 710, cause the processor 710 to perform a method according to an embodiment of the disclosure, or any variation thereof.
The computer program 721 may be configured with, for example, computer program code comprising computer program modules. For example, in an example embodiment, code in computer program 721 may include one or more program modules, including 721A, modules 721B, … …, for example. It should be noted that the division and number of modules are not fixed, and those skilled in the art may use suitable program modules or program module combinations according to actual situations, so that the processor 710 may execute the method according to the embodiment of the present disclosure or any variation thereof when the program modules are executed by the processor 710.
According to an embodiment of the present disclosure, at least one of the obtaining module 610, the feature determining module 620, the classifying module 630 and the evaluating module 640 may be implemented as a computer program module as described with reference to fig. 6, which, when executed by the processor 610, may implement the method described above.
The present disclosure also provides a computer-readable storage medium, which may be contained in the apparatus/device/system described in the above embodiments; or may exist separately and not be assembled into the device/apparatus/system. The computer-readable storage medium carries one or more programs which, when executed, implement the method according to an embodiment of the disclosure.
According to embodiments of the present disclosure, the computer-readable storage medium may be a non-volatile computer-readable storage medium, which may include, for example but is not limited to: a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
Those skilled in the art will appreciate that various combinations and/or combinations of features recited in the various embodiments and/or claims of the present disclosure can be made, even if such combinations or combinations are not expressly recited in the present disclosure. In particular, various combinations and/or combinations of the features recited in the various embodiments and/or claims of the present disclosure may be made without departing from the spirit or teaching of the present disclosure. All such combinations and/or associations are within the scope of the present disclosure.
While the disclosure has been shown and described with reference to certain exemplary embodiments thereof, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the disclosure as defined by the appended claims and their equivalents. Accordingly, the scope of the present disclosure should not be limited to the above-described embodiments, but should be defined not only by the appended claims, but also by equivalents thereof.

Claims (11)

1. A method of sample classification, comprising:
obtaining an unknown sample;
determining a feature vector of the unknown sample;
processing the feature vector by utilizing a pre-constructed classification model to obtain a class prediction result and a neighboring sample prediction result aiming at the unknown sample, wherein the class prediction result comprises a malicious sample class label; and
determining an evaluation value of the malicious sample class label based on the unknown sample and the neighboring sample prediction result.
2. The method of claim 1, wherein the neighbor sample predictors comprise: a nearest neighbor sample of the unknown sample, the nearest neighbor sample comprising: known samples having a distance between the unknown sample and the training set of classification models less than a first threshold;
the determining an evaluation value of the malicious sample class label based on the unknown sample and the neighboring sample prediction results comprises:
acquiring content information of at least one item in the unknown sample;
obtaining content information of at least one item in the nearest neighbor sample;
for each item of the at least one item, calculating a similarity between content information of the item in the unknown sample and content information of the item in the nearest sample;
determining a first evaluation value of the malicious sample category label based on the respective similarity of the at least one item; and
determining an evaluation value of the malicious sample category label based on the first evaluation value.
3. The method of claim 2, wherein the determining a first evaluation value for the malicious sample category tag based on a respective similarity of the at least one item comprises:
and respectively carrying out weighted summation on the similarity of the at least one item by utilizing the preset weight of the at least one item so as to calculate a first evaluation value of the malicious sample class label.
4. The method of claim 2, wherein the at least one item comprises at least one of: program database files, domain names, URL addresses, and IP addresses.
5. The method of claim 2, wherein,
the category prediction result further comprises: a confidence for the malicious sample class label;
the neighboring sample predictors further comprising: a distance between the nearest neighboring sample and the unknown sample;
the determining, based on the first evaluation value, an evaluation value of the malicious sample category tag includes:
taking the difference between the confidence coefficient and the distance as a second evaluation value of the malicious sample class label; and
and taking the sum of the first evaluation value and the second evaluation value as the evaluation value of the malicious sample class label.
6. The method of claim 1, further comprising:
when the evaluation value of the malicious sample class label is smaller than a third threshold value, determining that the malicious sample class label is wrong; and
and when the evaluation value of the malicious sample class label is larger than or equal to a third threshold value, determining that the malicious sample class label is correct.
7. The method of claim 1, wherein the processing the feature vectors using a pre-constructed classification model comprises:
processing the feature vectors based on a random forest algorithm to determine the category prediction result; and
processing the feature vector based on a k-neighborhood algorithm to determine the neighborhood sample predictor.
8. A sample sorting apparatus comprising:
the acquisition module is used for acquiring an unknown sample;
a feature determination module for determining a feature vector of the unknown sample;
the classification module is used for processing the feature vector by utilizing a pre-constructed classification model to obtain a class prediction result and a neighboring sample prediction result aiming at the unknown sample, wherein the class prediction result comprises a malicious sample class label; and
and the evaluation module is used for determining the evaluation value of the malicious sample class label based on the unknown sample and the adjacent sample prediction result.
9. A computer readable storage medium having stored thereon executable instructions which, when executed by a processor, cause the processor to perform the method of any one of claims 1 to 7.
10. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor when executing the computer program being adapted to implement the method of any of claims 1-7.
11. A computer program product comprising computer readable instructions, wherein the computer readable instructions, when executed, are for performing the method of any of claims 1-7.
CN202011028622.7A 2020-09-25 2020-09-25 Sample classification method, apparatus, computer device, medium, and program product Pending CN112116018A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011028622.7A CN112116018A (en) 2020-09-25 2020-09-25 Sample classification method, apparatus, computer device, medium, and program product

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011028622.7A CN112116018A (en) 2020-09-25 2020-09-25 Sample classification method, apparatus, computer device, medium, and program product

Publications (1)

Publication Number Publication Date
CN112116018A true CN112116018A (en) 2020-12-22

Family

ID=73797972

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011028622.7A Pending CN112116018A (en) 2020-09-25 2020-09-25 Sample classification method, apparatus, computer device, medium, and program product

Country Status (1)

Country Link
CN (1) CN112116018A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112214770A (en) * 2020-10-30 2021-01-12 奇安信科技集团股份有限公司 Malicious sample identification method and device, computing equipment and medium
CN113242236A (en) * 2021-05-08 2021-08-10 国家计算机网络与信息安全管理中心 Method for constructing network entity threat map

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106529293A (en) * 2016-11-09 2017-03-22 东巽科技(北京)有限公司 Sample classification determination method for malware detection
US20180114018A1 (en) * 2016-10-26 2018-04-26 Fortinet, Inc. Malware detection and classification based on memory semantic analysis
CN110363243A (en) * 2019-07-12 2019-10-22 腾讯科技(深圳)有限公司 The appraisal procedure and device of disaggregated model
CN110991538A (en) * 2019-12-05 2020-04-10 腾讯科技(深圳)有限公司 Sample classification method and device, storage medium and computer equipment
CN111062036A (en) * 2019-11-29 2020-04-24 暨南大学 Malicious software identification model construction method, malicious software identification medium and malicious software identification equipment
CN111091163A (en) * 2020-03-24 2020-05-01 杭州汇萃智能科技有限公司 Minimum distance classification method and device, computer equipment and storage medium
CN111353516A (en) * 2018-12-21 2020-06-30 华为技术有限公司 Sample classification method and model updating method for online learning
CN111368304A (en) * 2020-03-31 2020-07-03 绿盟科技集团股份有限公司 Malicious sample category detection method, device and equipment

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180114018A1 (en) * 2016-10-26 2018-04-26 Fortinet, Inc. Malware detection and classification based on memory semantic analysis
CN106529293A (en) * 2016-11-09 2017-03-22 东巽科技(北京)有限公司 Sample classification determination method for malware detection
CN111353516A (en) * 2018-12-21 2020-06-30 华为技术有限公司 Sample classification method and model updating method for online learning
CN110363243A (en) * 2019-07-12 2019-10-22 腾讯科技(深圳)有限公司 The appraisal procedure and device of disaggregated model
CN111062036A (en) * 2019-11-29 2020-04-24 暨南大学 Malicious software identification model construction method, malicious software identification medium and malicious software identification equipment
CN110991538A (en) * 2019-12-05 2020-04-10 腾讯科技(深圳)有限公司 Sample classification method and device, storage medium and computer equipment
CN111091163A (en) * 2020-03-24 2020-05-01 杭州汇萃智能科技有限公司 Minimum distance classification method and device, computer equipment and storage medium
CN111368304A (en) * 2020-03-31 2020-07-03 绿盟科技集团股份有限公司 Malicious sample category detection method, device and equipment

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
肖云倡 等: "一种基于行为的Android恶意软件家族聚类方法", 《武汉大学学报(理学版)》, vol. 62, no. 5, 17 August 2016 (2016-08-17) *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112214770A (en) * 2020-10-30 2021-01-12 奇安信科技集团股份有限公司 Malicious sample identification method and device, computing equipment and medium
CN112214770B (en) * 2020-10-30 2023-11-10 奇安信科技集团股份有限公司 Malicious sample identification method, device, computing equipment and medium
CN113242236A (en) * 2021-05-08 2021-08-10 国家计算机网络与信息安全管理中心 Method for constructing network entity threat map
CN113242236B (en) * 2021-05-08 2022-09-16 国家计算机网络与信息安全管理中心 Method for constructing network entity threat map

Similar Documents

Publication Publication Date Title
Kolosnjaji et al. Empowering convolutional networks for malware classification and analysis
CN106778241B (en) Malicious file identification method and device
Kolosnjaji et al. Deep learning for classification of malware system call sequences
US10200391B2 (en) Detection of malware in derived pattern space
WO2019128529A1 (en) Url attack detection method and apparatus, and electronic device
US11025649B1 (en) Systems and methods for malware classification
US10505960B2 (en) Malware detection by exploiting malware re-composition variations using feature evolutions and confusions
CN111460446B (en) Malicious file detection method and device based on model
US8931092B2 (en) System and method for computer inspection of information objects for shared malware components
US11580222B2 (en) Automated malware analysis that automatically clusters sandbox reports of similar malware samples
WO2019222462A1 (en) Identification of sensitive data using machine learning
US20160219068A1 (en) Method and apparatus for automatically identifying signature of malicious traffic using latent dirichlet allocation
US11159547B2 (en) Malware clustering approaches based on cognitive computing techniques
CN111338692B (en) Vulnerability classification method and device based on vulnerability codes and electronic equipment
CN111368289B (en) Malicious software detection method and device
Kumar et al. Malware classification using XGboost-Gradient boosted decision tree
CN112116018A (en) Sample classification method, apparatus, computer device, medium, and program product
US20170091451A1 (en) Creating rules describing malicious files based on file properties
Demertzis et al. A machine hearing framework for real-time streaming analytics using Lambda architecture
Ficco Comparing API call sequence algorithms for malware detection
More et al. Trust-based voting method for efficient malware detection
CN113986674A (en) Method and device for detecting abnormity of time sequence data and electronic equipment
AU2021251463B2 (en) Generating performance predictions with uncertainty intervals
Szarmach et al. Decision Tree-Based Algorithms for Detection of Damage in AIS Data
JP2011034377A (en) Information processor, information processing method and program

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information

Country or region after: China

Address after: Room 332, 3 / F, Building 102, 28 xinjiekouwei street, Xicheng District, Beijing 100088

Applicant after: Qianxin Technology Group Co.,Ltd.

Applicant after: Qianxin Wangshen information technology (Beijing) Co.,Ltd.

Address before: Room 332, 3 / F, Building 102, 28 xinjiekouwei street, Xicheng District, Beijing 100088

Applicant before: Qianxin Technology Group Co.,Ltd.

Country or region before: China

Applicant before: LEGENDSEC INFORMATION TECHNOLOGY (BEIJING) Inc.