CN114676430A - Malicious software identification method, device, equipment and computer readable storage medium - Google Patents

Malicious software identification method, device, equipment and computer readable storage medium Download PDF

Info

Publication number
CN114676430A
CN114676430A CN202210277743.8A CN202210277743A CN114676430A CN 114676430 A CN114676430 A CN 114676430A CN 202210277743 A CN202210277743 A CN 202210277743A CN 114676430 A CN114676430 A CN 114676430A
Authority
CN
China
Prior art keywords
feature
classifier
software
sample set
items
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210277743.8A
Other languages
Chinese (zh)
Inventor
尹嘉峻
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Puhui Enterprise Management Co Ltd
Original Assignee
Ping An Puhui Enterprise Management Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Puhui Enterprise Management Co Ltd filed Critical Ping An Puhui Enterprise Management Co Ltd
Priority to CN202210277743.8A priority Critical patent/CN114676430A/en
Publication of CN114676430A publication Critical patent/CN114676430A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/50Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
    • G06F21/55Detecting local intrusion or implementing counter-measures
    • G06F21/56Computer malware detection or handling, e.g. anti-virus arrangements
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/50Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
    • G06F21/55Detecting local intrusion or implementing counter-measures
    • G06F21/56Computer malware detection or handling, e.g. anti-virus arrangements
    • G06F21/561Virus type analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Security & Cryptography (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • Computer Hardware Design (AREA)
  • Health & Medical Sciences (AREA)
  • Virology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • General Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Evolutionary Biology (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to an intelligent decision technology, and discloses a malicious software identification method, which comprises the following steps: the method comprises the steps of carrying out feature item extraction, associated feature item analysis and effective feature screening on a sample set containing normal software and malicious software to obtain an effective feature set, constructing classifiers based on different classification algorithms according to the effective feature set, carrying out training of malicious software identification on each classifier by using the sample set, selecting the classifier with the highest harmonic value corresponding to the precision and the recall rate after training as a target classifier, and carrying out malicious software identification on software to be detected by using the target classifier. The invention also provides a malicious software identification device, equipment and a computer readable storage medium. The invention can solve the problem that the malicious software is difficult to identify after being disguised or deformed, and improves the efficiency and the accuracy of identifying the malicious software.

Description

Malicious software identification method, device, equipment and computer readable storage medium
Technical Field
The invention relates to the technical field of intelligent decision making, in particular to a malicious software identification method, a malicious software identification device, malicious software identification equipment and a computer readable storage medium.
Background
Malware refers to a program which is spread to invade a target host through a network, steals data or destroys normal operation of target host equipment and a network system.
The traditional malicious software defense measure is to adopt an overall signature method, namely, a code of software to be detected is analyzed, a hash value of the software to be detected is calculated by using a hash function, the calculated hash value is used as a signature of the software to be detected, and according to the principle that the signatures of the two pieces of software are completely the same, the two pieces of software are the same software, the signature of the software to be detected is matched with a malicious software signature in a preset database, so that whether the software to be detected is known malicious software is judged.
At present, more and more malicious software continuously disguise or deform through changing codes, so that a brand-new signature can be generated when signature detection is carried out on the malicious software, and defense measures based on overall signatures such as antivirus software, email filtering, sandbox and the like can be avoided.
Disclosure of Invention
The invention provides a method, a device and equipment for identifying malicious software and a computer readable storage medium, and mainly aims to solve the problem that the malicious software is difficult to identify after being disguised or deformed and improve the efficiency and the accuracy of identifying the malicious software.
In order to achieve the above object, the present invention provides a malware identification method, including:
acquiring a sample set containing normal software and malicious software, and extracting feature items of software operation features of the sample set;
performing correlation analysis on the feature items to obtain correlation feature items;
performing effective feature screening on the feature items of the sample set and the associated feature items to obtain an effective feature set;
constructing classifiers based on different classification algorithms according to the effective feature sets;
respectively carrying out training of malicious software identification on each classifier by using the sample set until the training meets a preset condition, and quitting the training to obtain an identification result of each classifier;
calculating to obtain a harmonic value between the precision rate and the recall rate of each classifier by using the identification result and the real result of the sample set, and selecting the classifier with the highest harmonic value as a target classifier;
And identifying the malicious software of the software to be detected by utilizing the target classifier.
Optionally, the obtaining a sample set including normal software and malicious software includes:
acquiring a software set containing normal software and malicious software for training;
extracting program files of each software in the software set in sequence;
the program software is assembled as the sample set.
Optionally, the extracting feature items of the software operating features of the sample set includes:
acquiring the file type of a program file corresponding to each sample in the sample set;
inquiring a feature label corresponding to each file type in a preset feature label set to obtain a feature label of a sample corresponding to the file type;
labeling the corresponding samples by using the feature labels of each sample in sequence to obtain labeled items;
and collecting all the labeled entries to obtain the characteristic items of the sample set.
Optionally, the performing a correlation analysis on the feature item to obtain a correlation feature item includes:
successively calculating the support degree of each feature item;
selecting feature items corresponding to the support degree which is greater than or equal to the preset minimum support degree to form a frequent item set;
And analyzing the frequent item set by using a preset association analysis algorithm to obtain association characteristic items.
Optionally, the performing effective feature screening on the feature items of the sample set and the associated feature items to obtain an effective feature set includes:
performing vector mapping on each feature item and each associated feature item of the sample set to obtain a feature vector set;
randomly selecting a preset number of feature vectors from the feature vector set as clustering centers;
sequentially calculating the distance from each feature vector in the feature vector set to the clustering center, and dividing each feature vector into categories corresponding to the clustering center with the minimum distance to obtain a plurality of category clusters;
and recalculating the clustering center of each category cluster, returning to the step of sequentially calculating the distance from each feature vector in the feature vector set to the clustering center until the clustering centers of the plurality of category clusters converge, and taking the feature items and the associated feature items corresponding to the converged category clusters as the effective feature set of the sample set.
Optionally, the training of identifying the malware is performed on each classifier by using the sample set respectively, and the training is exited until the training meets a preset condition, so as to obtain an identification result of each classifier, including:
Performing classification feature extraction on the sample set by using each classifier to obtain a classification feature set of the sample set;
performing probability calculation of malicious software on the classification feature set by using a preset prediction function to obtain an identification result of each sample in the sample set;
judging whether an error value between the recognition result and a real result of the sample set meets a preset condition or not by using a preset loss function;
if the error value does not meet the preset condition, adjusting the parameter value of each classifier, and returning to the step of performing classification feature extraction on the sample set by using each classifier;
and if the error value meets the preset condition, exiting the training of the classifier.
Optionally, the calculating, by using the recognition result and the real result of the sample set, a harmonic value between the precision rate and the recall rate of each classifier includes:
calculating a harmonic value for each of the classifiers using the following harmonic value formula:
Figure BDA0003556401360000031
wherein K represents the harmonic value, pjRepresenting said precision, pzRepresenting the recall rate.
In order to solve the above problem, the present invention also provides a malware identification apparatus, including:
The system comprises a sample characteristic extraction module, a characteristic analysis module and a characteristic analysis module, wherein the sample characteristic extraction module is used for acquiring a sample set containing normal software and malicious software, extracting a characteristic item of software operation characteristics of the sample set, and performing correlation analysis on the characteristic item to obtain a correlation characteristic item;
the effective characteristic screening module is used for carrying out effective characteristic screening on the characteristic items of the sample set and the associated characteristic items to obtain an effective characteristic set;
the multi-classifier training module is used for constructing classifiers based on different classification algorithms according to the effective feature set, respectively training the malicious software identification of each classifier by using the sample set, and quitting the training until the training meets a preset condition to obtain the identification result of each classifier;
the target classifier selection module is used for calculating a harmonic value between the precision rate and the recall rate of each classifier by using the identification result and the real result of the sample set, and selecting the classifier with the highest harmonic value as a target classifier;
and the target classifier application module is used for identifying the malicious software of the software to be detected by utilizing the target classifier.
In order to solve the above problem, the present invention also provides an electronic device, including:
A memory storing at least one computer program; and
and the processor executes the program stored in the memory to realize the malicious software identification method.
In order to solve the above problem, the present invention also provides a computer-readable storage medium, in which at least one computer program is stored, and the at least one computer program is executed by a processor in an electronic device to implement the malware identification method described above.
According to the invention, the effective feature set is obtained by carrying out feature extraction, associated feature analysis and feature screening on the sample set, different classifiers are further constructed through the effective feature set, and the classifier with the highest harmonic value is selected for identifying the software to be detected through carrying out malicious software identification on the sample set by the different classifiers.
Drawings
Fig. 1 is a flowchart illustrating a malware identification method according to an embodiment of the present invention;
fig. 2 is a schematic detailed implementation flowchart of one step in the malware identification method according to an embodiment of the present invention;
FIG. 3 is a functional block diagram of a malware identification device according to an embodiment of the present invention;
fig. 4 is a schematic structural diagram of an electronic device implementing the malware identification method according to an embodiment of the present invention.
The implementation, functional features and advantages of the present invention will be further described with reference to the accompanying drawings.
Detailed Description
It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
The embodiment of the application provides a malicious software identification method. The execution subject of the malware identification method includes, but is not limited to, at least one of electronic devices such as a server and a terminal, which can be configured to execute the method provided by the embodiment of the present application. In other words, the malware identification method may be performed by software or hardware installed in the terminal device or the server device, and the software may be a blockchain platform. The server side can be an independent server, and can also be a cloud server providing basic cloud computing services such as cloud service, a cloud database, cloud computing, cloud functions, cloud storage, Network service, cloud communication, middleware service, domain name service, security service, Content Delivery Network (CDN), big data and an artificial intelligence platform.
Referring to fig. 1, a flowchart of a malware identification method according to an embodiment of the present invention is shown. In the present embodiment, the malware identification method includes the following steps S1-S7:
s1, obtaining a sample set containing normal software and malicious software, and extracting feature items of software operation features of the sample set;
in the embodiment of the invention, the sample set comprises samples of two types, namely normal software and malicious software. The normal software can be collected from the internet, a personal host computer and is detected to be safe software through anti-virus software such as 360 and Kabaski. The malicious software includes but is not limited to backsdoor, Spyware, Trojan, Virus, word and other types of software, and can be downloaded from VXHeavens websites or collected in daily experiments.
In detail, the obtaining a sample set containing normal software and malware includes: acquiring a software set containing normal software and malicious software for training; extracting program files of each software in the software set in sequence; the program software is assembled as the sample set.
It can be understood that, in general, malware has certain intrusion behavior during the running process, for example, the malware requires to acquire the maximum control authority, for example, the authority to access the root directory; behaviors that violate security policies, such as cracking passwords, capturing critical data; or there is an attack behavior, for example, a large amount of messages are generated instantaneously to cause network congestion.
In the embodiment of the invention, the intrusion behavior of the malicious software is the normal behavior. The intrusion behavior and the normal behavior constitute software operating characteristics of the sample set.
In detail, referring to fig. 2, the extracting the feature items of the software operation features of the sample set includes the following steps S11-S14:
s11, acquiring the file type of the program file corresponding to each sample in the sample set;
s12, inquiring a feature label corresponding to each file type in a preset feature label set to obtain a feature label of a sample corresponding to the file type;
s13, labeling the corresponding samples by sequentially utilizing the feature labels of each sample to obtain labeled items;
and S14, collecting all the labeling items to obtain the characteristic items of the sample set.
In the embodiment of the present invention, the sample in the sample set is different in program file corresponding to each sample due to different development languages and different operation platforms, where the program file refers to a final operation program of each sample, for example, the program file corresponding to the Android software is an APK file, the program file corresponding to the Symbian operating system operation software is a sis file, and the program file corresponding to the J2ME platform operation software is a JAR file.
The preset feature tag set refers to intrusion behavior feature items and normal behavior feature items of each program file which are arranged in advance according to the structural features of the program file corresponding to each software, and the intrusion behavior feature items and the normal behavior feature items form the feature tag set.
In the embodiment of the invention, by taking the software of the program file of the PE type as an example, in the section header part of the PE file, the frequency of the corresponding software for referencing DLL (dynamic link library) and API (application program interface) can be counted, the information gain of each DLL and API is calculated, and the intrusion behavior characteristic item and the normal behavior characteristic item are set according to the frequency and the information gain condition obtained by counting.
S2, performing correlation analysis on the feature items to obtain correlation feature items;
in the embodiment of the present invention, the association analysis aims to mine all the feature items except the feature item with the logical binding relationship, and use the feature item with the logical binding relationship as an overall feature item for subsequent feature analysis and feature application.
In detail, the performing the association analysis on the feature item to obtain an associated feature item includes: successively calculating the support degree of each feature item; selecting feature items corresponding to the support degree which is greater than or equal to the preset minimum support degree to form a frequent item set; and analyzing the frequent item set by using a preset association analysis algorithm to obtain association characteristic items.
In the embodiment of the present invention, the support degree refers to the number of times that one feature item appears in a feature item set composed of all feature items. The minimum support degree refers to the minimum number of times that the characteristic item appears in the characteristic item set formed by all the characteristic items.
In the embodiment of the present invention, the preset association analysis algorithm may adopt a Weka algorithm or an Apriori algorithm.
In the embodiment of the invention, through performing the association analysis on the feature items, on one hand, the associated feature items are taken as an integral feature item, which is beneficial to improving the value of the feature items, and on the other hand, the number of the feature items can be reduced.
S3, performing effective feature screening on the feature items of the sample set and the associated feature items to obtain an effective feature set;
it can be understood that, in the feature item set composed of the associated feature items and other feature items, there may be some redundant feature items with similar features or features with low discrimination to normal software and malware, and the above redundant or low-value feature items can be removed through effective feature screening.
In detail, the performing effective feature screening on the feature items of the sample set and the associated feature items to obtain an effective feature set includes: performing vector mapping on each feature item and each associated feature item of the sample set to obtain a feature vector set; randomly selecting a preset number of feature vectors from the feature vector set as clustering centers; sequentially calculating the distance from each feature vector in the feature vector set to the clustering center, and dividing each feature vector into categories corresponding to the clustering center with the minimum distance to obtain a plurality of category clusters; and recalculating the clustering center of each category cluster, returning to the step of sequentially calculating the distance from each feature vector in the feature vector set to the clustering center until the clustering centers of the plurality of category clusters converge, and taking the feature items and the associated feature items corresponding to the converged category clusters as the effective feature set of the sample set.
The clustering algorithm adopted by the embodiment of the invention realizes the screening of effective characteristics. In another embodiment of the invention, an Embedded algorithm, a Filter algorithm or a Wrpper algorithm can be adopted to realize the screening of the effective characteristics.
S4, constructing classifiers based on different classification algorithms according to the effective feature sets;
in the embodiment of the invention, whether the sample is normal software or malicious software needs to be analyzed, so that the classification prediction is realized, and a classifier based on a KNN algorithm, a decision tree algorithm and a random forest algorithm can be constructed.
The KNN (K-Nearest Neighbor classifier) generally takes all samples of known classes as a reference, calculates the distances between the unknown samples and all known samples, selects K known samples closest to the unknown samples from the known samples, and classifies the unknown samples and the K Nearest samples into a class which is more than the classes of the unknown samples and the K Nearest samples according to a minority majority-obeying voting rule.
The decision tree is a classification algorithm which expresses the interrelation among each feature in the effective feature set based on a tree structure and classifies the sample set by utilizing the tree structure.
The random forest is a classification algorithm formed by a plurality of decision trees, and the random forest fuses classification results of all the decision trees to obtain a final classification result.
In detail, the constructing a classifier based on different classification algorithms according to the effective feature set includes: acquiring a preset data label of the effective characteristic set; searching a classification algorithm matched with the data label in a preset classification algorithm mapping table; and constructing a corresponding classifier by using each classification algorithm obtained by searching.
In the embodiment of the present invention, the preset data tag is used to represent the characteristics of the effective feature set, for example, the characteristics in the effective feature set have nonlinear and discrete characteristics, and the data tag of the effective feature set can be set to 1, and relatively, the data tag of the feature set having the non-discrete characteristics is set to 0.
The preset classification algorithm mapping table defines classification algorithms corresponding to different data tags, for example, when the data tag is 1, the corresponding classification algorithm may be a classification algorithm such as KNN, SVM (Support Vector Machine), and when the data tag is 0, the corresponding classification algorithm may be a classification algorithm such as GBDT (Gradient boost Decision Tree), XGBoost (X-Gradient boost Decision Tree).
S5, respectively carrying out training of malware identification on each classifier by using the sample set until the training meets a preset condition, and quitting the training to obtain an identification result of each classifier;
in the embodiment of the present invention, the preset condition may be that the training is exited when an error between a real result and a recognition result of the sample set reaches a preset error threshold, or the training is exited when the number of times of training reaches a preset training number threshold.
In detail, the training of identifying the malware is respectively performed on each classifier by using the sample set until the training meets a preset condition, and the training is exited to obtain an identification result of each classifier, and the training includes: carrying out classification feature extraction on the sample set by utilizing each classifier to obtain a classification feature set of the sample set; performing probability calculation of malicious software on the classification feature set by using a preset prediction function to obtain an identification result of each sample in the sample set; judging whether an error value between the recognition result and a real result of the sample set meets a preset condition or not by using a preset loss function; if the error value does not meet the preset condition, adjusting the parameter value of each classifier, and returning to the step of performing classification feature extraction on the sample set by using each classifier; and if the error value meets the preset condition, exiting the training of the classifier.
In an embodiment of the present invention, the preset prediction function may adopt a SOFTMAX function, where the SOFTMAX function is also called a normalization index function, and the SOFTMAX function is used to convert the classification feature set into malicious software probability, and the result with the largest probability is taken as the identification result of the sample set.
In the embodiment of the present invention, the preset loss function may adopt the following function:
Figure BDA0003556401360000091
wherein the rmse is the error value, the num is the number of the sample set, the i is the ith sample in the sample set, and the preiThe grt is the identification result of the ith sampleiIs the true result of the ith sample.
It should be noted that the training process described above may be adopted for various classifiers such as KNN, decision tree, random forest, and the like.
S6, calculating a harmonic value between the precision rate and the recall rate of each classifier by using the recognition result and the real result of the sample set, and selecting the classifier with the highest harmonic value as a target classifier;
according to the embodiment of the invention, both precision and comprehensiveness are required to be considered aiming at the identification of the malicious software, so that the precision rate and the recall rate of each classifier are required.
The accuracy rate is a ratio of the number of samples with the true result consistent with the identification result to the total number of samples corresponding to the identification result, for example, the number of samples with the identification result being malware is 1000, the number of samples with the true result being malware is 300, and the accuracy rate is 300/1000 if the true results of other 700 samples declare that the malware is not malicious.
The recall rate is the ratio of the number of correctly identified samples to all positive samples, for example, if all samples with true results of malware are 2000, and the number of correctly identified samples is 1200, the recall rate is 1200/2000.
In one embodiment of the present invention, the following formula for calculating the blending value between the precision rate and the recall rate is used:
Figure BDA0003556401360000092
wherein K represents the harmonic value, pjRepresenting said precision, pzRepresenting the recall rate.
And S7, identifying the malicious software of the software to be detected by utilizing the target classifier.
In the embodiment of the present invention, since the harmonic values of the precision rate and the recall rate are both considered, the classifier with the highest harmonic value of the precision rate and the recall rate is preferably selected as the target classifier.
In detail, the identifying malware software to be detected by using the target classifier includes: performing classification feature extraction on the software to be detected by using the target classifier to obtain a classification feature set of the software to be detected; and performing probability calculation of the malicious software on the classification feature set by using a preset prediction function to obtain an identification result of the software to be detected.
According to the method, the problem of disguise or deformation of the malicious software can be effectively solved through the machine learning method, and the efficiency and accuracy of malicious software identification are improved.
Fig. 3 is a functional block diagram of a malware identification apparatus according to an embodiment of the present invention.
The malware recognition apparatus 100 of the present invention may be installed in an electronic device. According to the implemented functions, the malware identification apparatus 100 may include a sample feature extraction module 101, an effective feature screening module 102, a multi-classifier training module 103, a target classifier selection module 104, and a target classifier application module 105. The module of the present invention, which may also be referred to as a unit, refers to a series of computer program segments that can be executed by a processor of an electronic device and that can perform a fixed function, and that are stored in a memory of the electronic device.
In the present embodiment, the functions regarding the respective modules/units are as follows:
the sample feature extraction module 101 is configured to obtain a sample set including normal software and malicious software, extract feature items of software operation features of the sample set, and perform correlation analysis on the feature items to obtain correlation feature items;
the effective feature screening module 102 is configured to perform effective feature screening on the feature items of the sample set and the associated feature items to obtain an effective feature set;
the multi-classifier training module 103 is configured to construct classifiers based on different classification algorithms according to the effective feature set, perform malware identification training on each classifier by using the sample set, and quit the training until the training meets a preset condition to obtain an identification result of each classifier;
the target classifier selecting module 104 is configured to calculate a harmonic value between a precision rate and a recall rate of each classifier by using the recognition result and the real result of the sample set, and select a classifier with a highest harmonic value as a target classifier;
the target classifier application module 105 is configured to perform malware identification on the software to be detected by using the target classifier.
In detail, when the modules in the malware identification apparatus 100 according to the embodiment of the present invention are used, the same technical means as the malware identification method described in fig. 1 to fig. 2 are adopted, and the same technical effect can be produced, and details are not described here.
Fig. 4 is a schematic structural diagram of an electronic device implementing a malware identification method according to an embodiment of the present invention.
The electronic device 1 may comprise a processor 10, a memory 11 and a bus, and may further comprise a computer program, such as a malware recognition program, stored in the memory 11 and executable on the processor 10.
The memory 11 includes at least one type of readable storage medium, which includes flash memory, removable hard disk, multimedia card, card-type memory (e.g., SD or DX memory, etc.), magnetic memory, magnetic disk, optical disk, etc. The memory 11 may in some embodiments be an internal storage unit of the electronic device 1, such as a removable hard disk of the electronic device 1. The memory 11 may also be an external storage device of the electronic device 1 in other embodiments, such as a plug-in mobile hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like, which are provided on the electronic device 1. Further, the memory 11 may also include both an internal storage unit and an external storage device of the electronic device 1. The memory 11 may be used not only to store application software installed in the electronic device 1 and various types of data, such as codes of a malware recognition program, but also to temporarily store data that has been output or is to be output.
The processor 10 may be composed of an integrated circuit in some embodiments, for example, a single packaged integrated circuit, or may be composed of a plurality of integrated circuits packaged with the same function or different functions, including one or more Central Processing Units (CPUs), microprocessors, digital Processing chips, graphics processors, and combinations of various control chips. The processor 10 is a Control Unit (Control Unit) of the electronic device, connects various components of the whole electronic device by using various interfaces and lines, and executes various functions and processes data of the electronic device 1 by running or executing programs or modules (e.g., malware recognition programs, etc.) stored in the memory 11 and calling data stored in the memory 11.
The bus may be a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The bus may be divided into an address bus, a data bus, a control bus, etc. The bus is arranged to enable connection communication between the memory 11 and at least one processor 10 or the like.
Fig. 4 only shows an electronic device with components, and it will be understood by those skilled in the art that the structure shown in fig. 4 does not constitute a limitation of the electronic device 1, and may comprise fewer or more components than those shown, or some components may be combined, or a different arrangement of components.
For example, although not shown, the electronic device 1 may further include a power supply (such as a battery) for supplying power to each component, and preferably, the power supply may be logically connected to the at least one processor 10 through a power management device, so as to implement functions of charge management, discharge management, power consumption management, and the like through the power management device. The power supply may also include any component of one or more dc or ac power sources, recharging devices, power failure detection circuitry, power converters or inverters, power status indicators, and the like. The electronic device 1 may further include various sensors, a bluetooth module, a Wi-Fi module, and the like, which are not described herein again.
Further, the electronic device 1 may further include a network interface, and optionally, the network interface may include a wired interface and/or a wireless interface (such as a WI-FI interface, a bluetooth interface, etc.), which are generally used for establishing a communication connection between the electronic device 1 and other electronic devices.
Optionally, the electronic device 1 may further comprise a user interface, which may be a Display (Display), an input unit (such as a Keyboard), and optionally a standard wired interface, a wireless interface. Alternatively, in some embodiments, the display may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an OLED (Organic Light-Emitting Diode) touch device, or the like. The display, which may also be referred to as a display screen or display unit, is suitable for displaying information processed in the electronic device 1 and for displaying a visualized user interface, among other things.
It is to be understood that the described embodiments are for purposes of illustration only and that the scope of the appended claims is not limited to such structures.
The malware recognition program stored in the memory 11 of the electronic device 1 is a combination of instructions that, when executed in the processor 10, implement the following method:
acquiring a sample set containing normal software and malicious software, and extracting feature items of software operation features of the sample set;
performing correlation analysis on the feature items to obtain correlation feature items;
performing effective feature screening on the feature items of the sample set and the associated feature items to obtain an effective feature set;
Constructing classifiers based on different classification algorithms according to the effective feature sets;
respectively carrying out training of malicious software identification on each classifier by using the sample set until the training meets a preset condition, and quitting the training to obtain an identification result of each classifier;
calculating to obtain a harmonic value between the precision rate and the recall rate of each classifier by using the identification result and the real result of the sample set, and selecting the classifier with the highest harmonic value as a target classifier;
and identifying the malicious software of the software to be detected by utilizing the target classifier.
Further, the integrated modules/units of the electronic device 1, if implemented in the form of software functional units and sold or used as separate products, may be stored in a computer readable storage medium. The computer readable storage medium may be volatile or non-volatile. For example, the computer-readable medium may include: any entity or device capable of carrying the computer program code, recording medium, usb disk, removable hard disk, magnetic disk, optical disk, computer Memory, Read-Only Memory (ROM).
The invention also provides a computer-readable storage medium, in which a computer program is stored, which computer program, when executed by a processor of an electronic device, may implement the method of:
acquiring a sample set containing normal software and malicious software, and extracting feature items of software operation features of the sample set;
performing correlation analysis on the feature items to obtain correlation feature items;
performing effective feature screening on the feature items of the sample set and the associated feature items to obtain an effective feature set;
constructing classifiers based on different classification algorithms according to the effective feature sets;
respectively carrying out training of malicious software identification on each classifier by using the sample set until the training meets a preset condition, and quitting the training to obtain an identification result of each classifier;
calculating to obtain a harmonic value between the precision rate and the recall rate of each classifier by using the identification result and the real result of the sample set, and selecting the classifier with the highest harmonic value as a target classifier;
and identifying the malicious software of the software to be detected by utilizing the target classifier.
In the embodiments provided in the present invention, it should be understood that the disclosed apparatus, device and method can be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the modules is only one logical functional division, and other divisions may be realized in practice.
The modules described as separate parts may or may not be physically separate, and parts displayed as modules may or may not be physical units, may be located in one position, or may be distributed on multiple network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
In addition, functional modules in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, or in a form of hardware plus a software functional module.
It will be evident to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied in other specific forms without departing from the spirit or essential attributes thereof.
The present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference signs in the claims shall not be construed as limiting the claim concerned.
The block chain is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, a consensus mechanism, an encryption algorithm and the like. A block chain (Blockchain), which is essentially a decentralized database, is a string of data blocks associated by using a cryptographic method, and each data block contains information of a batch of network transactions, which is used for verifying the validity (anti-counterfeiting) of the information and generating a next block. The blockchain may include a blockchain underlying platform, a platform product service layer, an application service layer, and the like.
The embodiment of the application can acquire and process related data based on an artificial intelligence technology. Among them, Artificial Intelligence (AI) is a theory, method, technique and application system that simulates, extends and expands human Intelligence using a digital computer or a machine controlled by a digital computer, senses the environment, acquires knowledge and uses the knowledge to obtain the best result.
Furthermore, it is obvious that the word "comprising" does not exclude other elements or steps, and the singular does not exclude the plural. A plurality of units or means recited in the system claims may also be implemented by one unit or means in software or hardware. The terms second, etc. are used to denote names, but not any particular order.
Finally, it should be noted that the above embodiments are only intended to illustrate the technical solutions of the present invention and not to limit the same, and although the present invention is described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications or equivalent substitutions can be made to the technical solutions of the present invention without departing from the spirit and scope of the technical solutions of the present invention.

Claims (10)

1. A malware identification method, the method comprising:
acquiring a sample set containing normal software and malicious software, and extracting feature items of software operation features of the sample set;
performing correlation analysis on the feature items to obtain correlation feature items;
performing effective feature screening on the feature items of the sample set and the associated feature items to obtain an effective feature set;
constructing classifiers based on different classification algorithms according to the effective feature sets;
respectively carrying out training of malicious software identification on each classifier by using the sample set until the training meets a preset condition, and quitting the training to obtain an identification result of each classifier;
calculating to obtain a harmonic value between the precision rate and the recall rate of each classifier by using the identification result and the real result of the sample set, and selecting the classifier with the highest harmonic value as a target classifier;
And identifying the malicious software of the software to be detected by utilizing the target classifier.
2. The method of claim 1, wherein the obtaining a sample set containing normal software and malware comprises:
acquiring a software set containing normal software and malicious software for training;
extracting program files of each software in the software set in sequence;
the program software is assembled as the sample set.
3. A malware identification method as claimed in any one of claims 1 or 2, wherein said extracting feature items of software operating features of said sample set comprises:
acquiring the file type of a program file corresponding to each sample in the sample set;
inquiring a feature label corresponding to each file type in a preset feature label set to obtain a feature label of a sample corresponding to the file type;
labeling the corresponding samples by using the feature labels of each sample in sequence to obtain labeled items;
and collecting all the labeled entries to obtain the characteristic items of the sample set.
4. The method for identifying malware according to claim 1, wherein the performing correlation analysis on the feature items to obtain correlated feature items comprises:
Successively calculating the support degree of each feature item;
selecting feature items corresponding to the support degree which is greater than or equal to the preset minimum support degree to form a frequent item set;
and analyzing the frequent item set by using a preset association analysis algorithm to obtain association characteristic items.
5. The method as claimed in claim 1, wherein the performing of effective feature screening on the feature items of the sample set and the associated feature items to obtain an effective feature set comprises:
performing vector mapping on each feature item and each associated feature item of the sample set to obtain a feature vector set;
randomly selecting a preset number of feature vectors from the feature vector set as clustering centers;
sequentially calculating the distance from each feature vector in the feature vector set to the clustering center, and dividing each feature vector into categories corresponding to the clustering center with the minimum distance to obtain a plurality of category clusters;
and recalculating the clustering center of each category cluster, returning to the step of sequentially calculating the distance from each feature vector in the feature vector set to the clustering center until the clustering centers of the plurality of category clusters converge, and taking the feature items and the associated feature items corresponding to the converged category clusters as the effective feature set of the sample set.
6. The method as claimed in claim 1, wherein the training for malware recognition is performed on each classifier by using the sample set, and the training is exited until the training satisfies a preset condition, so as to obtain a recognition result of each classifier, and the method includes:
performing classification feature extraction on the sample set by using each classifier to obtain a classification feature set of the sample set;
performing probability calculation of malicious software on the classification feature set by using a preset prediction function to obtain an identification result of each sample in the sample set;
judging whether an error value between the recognition result and a real result of the sample set meets a preset condition or not by using a preset loss function;
if the error value does not meet the preset condition, adjusting the parameter value of each classifier, and returning to the step of performing classification feature extraction on the sample set by using each classifier;
and if the error value meets the preset condition, exiting the training of the classifier.
7. The method of claim 1, wherein the calculating a harmonic value between the precision rate and the recall rate of each of the classifiers by using the identification result and the real results of the sample set comprises:
Calculating a harmonic value for each of the classifiers using a harmonic value formula as follows:
Figure FDA0003556401350000031
wherein K represents the harmonic value, pjRepresenting said precision, pzRepresenting the recall rate.
8. An apparatus for malware identification, the apparatus comprising:
the system comprises a sample feature extraction module, a data analysis module and a data analysis module, wherein the sample feature extraction module is used for acquiring a sample set containing normal software and malicious software, extracting feature items of software operation features of the sample set, and performing correlation analysis on the feature items to obtain correlation feature items;
the effective characteristic screening module is used for carrying out effective characteristic screening on the characteristic items of the sample set and the associated characteristic items to obtain an effective characteristic set;
the multi-classifier training module is used for constructing classifiers based on different classification algorithms according to the effective feature set, respectively training the malicious software identification of each classifier by using the sample set, and quitting the training until the training meets a preset condition to obtain the identification result of each classifier;
the target classifier selection module is used for calculating a harmonic value between the precision rate and the recall rate of each classifier by using the identification result and the real result of the sample set, and selecting the classifier with the highest harmonic value as a target classifier;
And the target classifier application module is used for identifying the malicious software of the software to be detected by utilizing the target classifier.
9. An electronic device, characterized in that the electronic device comprises:
at least one processor; and the number of the first and second groups,
a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,
the memory stores a computer program executable by the at least one processor, the instructions being executable by the at least one processor to enable the at least one processor to perform the malware identification method of any one of claims 1 to 7.
10. A computer-readable storage medium storing a computer program, wherein the computer program, when executed by a processor, implements the malware identification method of any one of claims 1 to 7.
CN202210277743.8A 2022-03-21 2022-03-21 Malicious software identification method, device, equipment and computer readable storage medium Pending CN114676430A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210277743.8A CN114676430A (en) 2022-03-21 2022-03-21 Malicious software identification method, device, equipment and computer readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210277743.8A CN114676430A (en) 2022-03-21 2022-03-21 Malicious software identification method, device, equipment and computer readable storage medium

Publications (1)

Publication Number Publication Date
CN114676430A true CN114676430A (en) 2022-06-28

Family

ID=82075109

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210277743.8A Pending CN114676430A (en) 2022-03-21 2022-03-21 Malicious software identification method, device, equipment and computer readable storage medium

Country Status (1)

Country Link
CN (1) CN114676430A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115758368A (en) * 2023-01-10 2023-03-07 北京亿赛通科技发展有限责任公司 Malicious cracked software prediction method and device, electronic equipment and storage medium

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115758368A (en) * 2023-01-10 2023-03-07 北京亿赛通科技发展有限责任公司 Malicious cracked software prediction method and device, electronic equipment and storage medium

Similar Documents

Publication Publication Date Title
CN109359439B (en) software detection method, device, equipment and storage medium
Baldwin et al. Leveraging support vector machine for opcode density based detection of crypto-ransomware
RU2708356C1 (en) System and method for two-stage classification of files
Yang et al. Detecting android malware by applying classification techniques on images patterns
KR20150038738A (en) Detection of confidential information
CN113688923B (en) Order abnormity intelligent detection method and device, electronic equipment and storage medium
CN113360803B (en) Data caching method, device, equipment and storage medium based on user behaviors
Niu et al. Detecting malware on X86-based IoT devices in autonomous driving
CN111723371A (en) Method for constructing detection model of malicious file and method for detecting malicious file
CN111930610B (en) Software homology detection method, device, equipment and storage medium
CN112148305A (en) Application detection method and device, computer equipment and readable storage medium
CN110704841A (en) Convolutional neural network-based large-scale android malicious application detection system and method
Abdessadki et al. A new classification based model for malicious PE files detection
D'hooge et al. In-depth comparative evaluation of supervised machine learning approaches for detection of cybersecurity threats
CN108664501B (en) Advertisement auditing method and device and server
Özkan et al. Evaluation of convolutional neural network features for malware detection
CN114676430A (en) Malicious software identification method, device, equipment and computer readable storage medium
CN117390933B (en) Process data tracing method and system for lubricating oil preparation
CN111353109A (en) Malicious domain name identification method and system
CN114547696A (en) File desensitization method and device, electronic equipment and storage medium
Čeponis et al. Evaluation of deep learning methods efficiency for malicious and benign system calls classification on the AWSCTD
CN112579781A (en) Text classification method and device, electronic equipment and medium
CN116485512A (en) Bank data analysis method and system based on reinforcement learning
CN115099339A (en) Fraud behavior identification method and device, electronic equipment and storage medium
CN115766215A (en) Abnormal flow detection method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination