CN109359439A - Software detecting method, device, equipment and storage medium - Google Patents
Software detecting method, device, equipment and storage medium Download PDFInfo
- Publication number
- CN109359439A CN109359439A CN201811257390.5A CN201811257390A CN109359439A CN 109359439 A CN109359439 A CN 109359439A CN 201811257390 A CN201811257390 A CN 201811257390A CN 109359439 A CN109359439 A CN 109359439A
- Authority
- CN
- China
- Prior art keywords
- feature
- software
- type feature
- sample
- character string
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 86
- 238000012549 training Methods 0.000 claims abstract description 66
- 238000010801 machine learning Methods 0.000 claims abstract description 65
- 238000012545 processing Methods 0.000 claims abstract description 45
- 238000010276 construction Methods 0.000 claims abstract description 27
- 239000011159 matrix material Substances 0.000 claims abstract description 26
- 238000006243 chemical reaction Methods 0.000 claims abstract description 14
- 239000000284 extract Substances 0.000 claims abstract description 11
- 238000001514 detection method Methods 0.000 claims description 28
- 230000006870 function Effects 0.000 claims description 23
- 238000012360 testing method Methods 0.000 claims description 18
- 238000004891 communication Methods 0.000 claims description 10
- 238000011156 evaluation Methods 0.000 claims description 8
- 230000009467 reduction Effects 0.000 claims description 5
- 241000700605 Viruses Species 0.000 description 14
- 230000008569 process Effects 0.000 description 11
- 239000013598 vector Substances 0.000 description 10
- 238000010586 diagram Methods 0.000 description 7
- 238000000605 extraction Methods 0.000 description 7
- 230000008901 benefit Effects 0.000 description 6
- 238000011835 investigation Methods 0.000 description 6
- 238000003672 processing method Methods 0.000 description 6
- 238000013145 classification model Methods 0.000 description 5
- 238000003066 decision tree Methods 0.000 description 4
- 238000004458 analytical method Methods 0.000 description 3
- 230000015572 biosynthetic process Effects 0.000 description 3
- 238000004590 computer program Methods 0.000 description 3
- 230000000694 effects Effects 0.000 description 3
- 238000005538 encapsulation Methods 0.000 description 3
- 230000035772 mutation Effects 0.000 description 3
- 230000004224 protection Effects 0.000 description 3
- 230000006798 recombination Effects 0.000 description 3
- 238000005215 recombination Methods 0.000 description 3
- 230000002155 anti-virotic effect Effects 0.000 description 2
- 230000007547 defect Effects 0.000 description 2
- 238000013461 design Methods 0.000 description 2
- 230000001066 destructive effect Effects 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 239000000203 mixture Substances 0.000 description 2
- 238000012544 monitoring process Methods 0.000 description 2
- 230000009466 transformation Effects 0.000 description 2
- 230000003612 virological effect Effects 0.000 description 2
- 241001377938 Yara Species 0.000 description 1
- 238000009825 accumulation Methods 0.000 description 1
- 230000009471 action Effects 0.000 description 1
- 230000001154 acute effect Effects 0.000 description 1
- 230000006399 behavior Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 229910002056 binary alloy Inorganic materials 0.000 description 1
- 238000000546 chi-square test Methods 0.000 description 1
- 230000000052 comparative effect Effects 0.000 description 1
- 239000002131 composite material Substances 0.000 description 1
- 239000013256 coordination polymer Substances 0.000 description 1
- 125000004122 cyclic group Chemical group 0.000 description 1
- 230000007423 decrease Effects 0.000 description 1
- 238000000151 deposition Methods 0.000 description 1
- 230000004069 differentiation Effects 0.000 description 1
- 239000002360 explosive Substances 0.000 description 1
- 238000007667 floating Methods 0.000 description 1
- ZXQYGBMAQZUVMI-GCMPRSNUSA-N gamma-cyhalothrin Chemical compound CC1(C)[C@@H](\C=C(/Cl)C(F)(F)F)[C@H]1C(=O)O[C@H](C#N)C1=CC=CC(OC=2C=CC=CC=2)=C1 ZXQYGBMAQZUVMI-GCMPRSNUSA-N 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 238000007689 inspection Methods 0.000 description 1
- 238000007477 logistic regression Methods 0.000 description 1
- 238000012423 maintenance Methods 0.000 description 1
- 238000005259 measurement Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 239000004576 sand Substances 0.000 description 1
- 238000012216 screening Methods 0.000 description 1
- 230000011218 segmentation Effects 0.000 description 1
- 230000035945 sensitivity Effects 0.000 description 1
- 239000007787 solid Substances 0.000 description 1
- 230000003068 static effect Effects 0.000 description 1
- 238000012706 support-vector machine Methods 0.000 description 1
- 238000012546 transfer Methods 0.000 description 1
- 230000001960 triggered effect Effects 0.000 description 1
- 238000012795 verification Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F21/00—Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
- G06F21/10—Protecting distributed programs or content, e.g. vending or licensing of copyrighted material ; Digital rights management [DRM]
- G06F21/12—Protecting executable software
Landscapes
- Engineering & Computer Science (AREA)
- Software Systems (AREA)
- Theoretical Computer Science (AREA)
- Multimedia (AREA)
- Technology Law (AREA)
- Computer Hardware Design (AREA)
- Computer Security & Cryptography (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Storage Device Security (AREA)
Abstract
The invention discloses a kind of software detecting method, device, equipment and storage mediums, which comprises extracts the numeric type feature and nonumeric type feature that each sample is included in software sample library;The nonumeric type feature is handled using the selected non-encrypted hash algorithm of N kind, and processing result is converted into numeric type feature;The N is the integer greater than 1;According to the numeric type feature that the numeric type feature and conversion that include in each sample obtain, construction feature matrix;Utilize the eigenmatrix training machine Study strategies and methods;Using the Machine learning classifiers, target software is detected.The present invention can convert the complex characters string feature extracted from Malware sample to the Hash feature for being easy to machine learning algorithm processing, to reduce model training difficulty, training speed is significantly improved, space expense is reduced, improves Malware discrimination precision.
Description
Technical field
The present invention relates to detection technique field more particularly to a kind of software detecting method, device, equipment and storage mediums.
Background technique
Malware mainly includes destructive computer virus, worm-type virus, wooden horse back door, vulnerability exploit program, advertisement fishing
Fish code etc., these Malwares can evade technology and security breaches combine with a variety of, break through existing traditional human system
Monitoring, to user benefit generate greatly destroy.The purpose of malware detection system seeks to find to mix in normal in time
Malware in file, and independently take measures before it generates damaging influence as far as possible, and notify user in time.
Malware detection method includes static file analysis detection and two kinds of dynamic behaviour analysis detection at present.It is existing
Malware stationary detection technique relies primarily on manually generated condition code library and rule base and is matched, even if more advanced
Heuristic virus investigation detection technique, it is also desirable to judgement be assisted to recognize by the expert knowledge library of manual maintenance.However current mutual
It networks in the case where explosive extension, thousands of host and user all suffer from all kinds of mutation in internet, polymorphic, shell adding,
Add the threat of Malwares such as obscuring.How to cope with variant virus and malware attacks rapidly, to magnanimity and type it is numerous
The processing analysis that more Malwares is automated, improves the recall rate of Malware, reduces rate of false alarm, become current evil
The main bugbear for software detection means of anticipating.
Detection method based on machine learning does not depend on condition code library and expert knowledge library, fast using trained model
The differentiation of speed automation recognizes Malware, and can classify by further trained model to Malware, with compared with
Good research and application prospect.Machine learning malware detection method relies primarily on two big steps, one is choosing suitable foot
The sample of amount, and feature therein is extracted, the numerical value and nonumeric feature after extraction are screened and are cleaned, and are picked
Except missing, error items, logarithm value tag does standardization and normalized, then carries out specific coding to nonumeric feature, generally
Single hot spot (one-hot) coding is carried out, is converted into the numeric form of computer capacity identifying processing, then by the feature of all extractions
Combine to form eigenmatrix.The second is need to select suitable machine learning modeling pattern, it is soft for current magnanimity malice
The problem of part is brought, traditional logistic regression, naive Bayesian, support vector machines, the methods of decision tree is because of training speed
Slowly, the factors such as consumption resource is huge, and model evaluation effect is poor are not suitable for malware detection and identification.
Traditional characteristic of malware extracting method is compiled for the character string information that extracts, or using one-hot
Code, or it is converted into the value type of AscII code, this processing mode haves the defects that as follows:
1, one-hot coding compares character string number in string assemble, string name in the case where all determining
Effectively, and in Malware the character string feature extracted is because Malware total amount is that unlimited, new Malware layer goes out not
Thoroughly, therefore by the string assemble of training sample estimate that the string assemble of population sample can bring very big deviation;
2, character string, which turns AscII code, really can convert character string type feature to value type feature, but in view of difference
The character string characteristic length of sample extraction may be inconsistent, so that the feature quantity after conversion is also inconsistent, how to AscII code
It is more difficult that the character string of form carries out participle merogenesis, it is still desirable to which algorithm for design will input the eigenmatrix of machine learning model
Dimension conversion is consistent, so that complexity is still higher;
3, it is difficult to cope with the magnanimity that viral generator generates and add and obscure, character string mutation, artificial plus interference is mixed sand etc. and to be supported
The various modes of imperial virus investigation engine detection.
As it can be seen that the existing characteristic of malware extracting method based in machine learning detection method, which is not able to satisfy, to be needed
It asks, so, how to convert the complex characters string feature extracted from Malware sample to from being easy to machine learning algorithm
The feature of reason improves training speed, becomes the technical problems to be solved by the invention to reduce model training difficulty.
Summary of the invention
In view of the above problems, the embodiment of the present invention is proposed in order to provide a kind of software detecting method, device, equipment and is deposited
Storage media.
One aspect according to an embodiment of the present invention provides a kind of software detecting method, comprising:
Extract the numeric type feature and nonumeric type feature that each sample is included in software sample library;
The nonumeric type feature is handled using the selected non-encrypted hash algorithm of N kind, and processing result is turned
It is changed to numeric type feature;The N is the integer greater than 1;
According to the numeric type feature that the numeric type feature and conversion that include in each sample obtain, construction feature square
Battle array;
Utilize the eigenmatrix training machine Study strategies and methods;
Using the Machine learning classifiers, target software is detected.
Other side according to an embodiment of the present invention provides a kind of software detection device, comprising:
Characteristic extracting module, for extracting the numeric type feature and nonumeric type spy that each sample in software sample library is included
Sign;
Feature processing block, for using the selected non-encrypted hash algorithm of N kind to the nonumeric type feature at
Reason, and processing result is converted into numeric type feature;The N is the integer greater than 1;
Matrix construction module, the numerical value for being obtained according to the numeric type feature and conversion that include in each sample
Type feature, construction feature matrix;
Learning training module, for utilizing the eigenmatrix training machine Study strategies and methods;
Detection module detects target software for utilizing the Machine learning classifiers.
The third aspect according to an embodiment of the present invention, provides a kind of calculating equipment, the calculating equipment include: memory,
Processor and communication bus;The communication bus is for realizing the connection communication between processor and memory;
The processor is for executing the software checking program stored in memory, to realize following method and step:
Extract the numeric type feature and nonumeric type feature that each sample is included in software sample library;
The nonumeric type feature is handled using the selected non-encrypted hash algorithm of N kind, and processing result is turned
It is changed to numeric type feature;The N is the integer greater than 1;
According to the numeric type feature that the numeric type feature and conversion that include in each sample obtain, construction feature square
Battle array;
Utilize the eigenmatrix training machine Study strategies and methods;
Using the Machine learning classifiers, target software is detected.
Fourth aspect according to an embodiment of the present invention provides a kind of computer readable storage medium, described computer-readable
Computer program is stored on storage medium, which realizes following method and step when being executed by processor:
Extract the numeric type feature and nonumeric type feature that each sample is included in software sample library;
The nonumeric type feature is handled using the selected non-encrypted hash algorithm of N kind, and processing result is turned
It is changed to numeric type feature;The N is the integer greater than 1;
According to the numeric type feature that the numeric type feature and conversion that include in each sample obtain, construction feature square
Battle array;
Utilize the eigenmatrix training machine Study strategies and methods;
Using the Machine learning classifiers, target software is detected.
Compared with prior art, the invention has the following beneficial effects:
The software detection scheme that the embodiment of the present invention proposes is used based on the non-encrypted Hash feature of mixing and machine learning
The software detecting method of model can convert the complex characters string feature extracted from Malware sample to and be easy to machine
The Hash feature of learning algorithm processing significantly improves training speed to reduce model training difficulty, reduces space and opens
Pin, improves Malware discrimination precision.
The program lacks most of complete virus signature library etc. and answers for lacking abundant Malware expert knowledge library
With scene, there is preferable detection effect.While the common mutation of malware author and polymorphic equal escapes detection can be resisted
Means, to artificial addition interference, shell adding and plus obscure and have stronger resistivity, use the machine learning of this feature processing method
Classifier has preferable anti-interference ability and robustness.
Above description is only the general introduction of technical solution of the embodiment of the present invention, in order to better understand the embodiment of the present invention
Technological means, and can be implemented in accordance with the contents of the specification, and in order to allow above and other mesh of the embodiment of the present invention
, feature and advantage can be more clearly understood, the special specific embodiment for lifting the embodiment of the present invention below.
Detailed description of the invention
By reading the following detailed description of the preferred embodiment, various other advantages and benefits are common for this field
Technical staff will become clear.The drawings are only for the purpose of illustrating a preferred embodiment, and is not considered as to the present invention
The limitation of embodiment.And throughout the drawings, the same reference numbers will be used to refer to the same parts.In the accompanying drawings:
Fig. 1 is a kind of flow chart for software detecting method that first embodiment of the invention provides;
Fig. 2 is a kind of flow chart for software detecting method that second embodiment of the invention provides;
Fig. 3 is a kind of characteristic processing method based on the non-encrypted hash algorithm of mixing that third embodiment of the invention provides
Flow chart;
Fig. 4 is the flow chart of mixing splicing and recombination method in third embodiment of the invention;
Fig. 5 is a kind of structural block diagram for software detection device that fourth embodiment of the invention provides;
Fig. 6 is a kind of structural block diagram for software detection device that fifth embodiment of the invention provides;
Fig. 7 is a kind of structural block diagram for calculating equipment that sixth embodiment of the invention provides.
Specific embodiment
Exemplary embodiments of the present disclosure are described in more detail below with reference to accompanying drawings.Although showing the disclosure in attached drawing
Exemplary embodiment, it being understood, however, that may be realized in various forms the disclosure without should be by embodiments set forth here
It is limited.On the contrary, these embodiments are provided to facilitate a more thoroughly understanding of the present invention, and can be by the scope of the present disclosure
It is fully disclosed to those skilled in the art.
In the first embodiment of the invention, a kind of software detecting method is provided, it is therefore an objective to be directed to existing malware detection
The defect of method proposes a kind of based on the software detecting method for mixing non-encrypted Hash feature.Specifically, as shown in Figure 1, originally
Described method includes following steps for embodiment:
Step S101 extracts each sample is included in software sample library numeric type feature and nonumeric type feature;
In the embodiment of the present invention, before executing this step, also executes and obtain software sample, construct the mistake in software sample library
Journey.Specifically, marking the sample is black sample, and determines the type of Malware when getting Malware sample;When obtaining
When getting normal software sample, marking the sample is white sample.It is carried out in this way using the software in software sample library subsequent
Feature extraction and machine-learning process.
In the embodiment of the present invention, illustratively, the sample program in software sample library is mainly PE (Portable
Executive, transplantable executable) file, or DLL (the Dynamic Link with similar file structure
Liberary, dynamic link library) file.Numeric type feature can be extracted in sample program file in this way and nonumeric type is special
Sign.Certainly, sample program may be other kinds of file in the embodiment of the present invention, and the present invention does not limit using PE uniquely simultaneously
Or dll file type.
In one particular embodiment of the present invention, the numeric type feature includes one or more of following feature:
Code head file information, code segment information, character string statistical information, sample general evaluation system information, function row in imported address list
Table, export function list, byte count information and byte information entropy statistics.
In the embodiment of the present invention, the nonumeric type feature refers mainly to character string type data.In a tool of the invention
In body embodiment, the nonumeric type feature includes one or more of following feature: recognizable character in software head information
String sequence, all path string sequences, all uniform resource locator character string sequences, all registry entry characters
String sequence, the machine models character string of software head information, all name section character string sequences of software, entrance name character string,
The character string sequence of continuous Q or more recognizable character composition in all sections of software;Wherein, Q is positive integer.It is exemplary at one
In embodiment, the Q takes 5, but Q is not limited to take the value.
It should be noted that those skilled in the art can carry out feature increase on the basis of features described above according to demand
Or it reduces, but all within protection thought range of the invention.
Step S102 is handled the nonumeric type feature using the selected non-encrypted hash algorithm of N kind, and will place
Reason result is converted to numeric type feature;The N is the integer greater than 1;
In the embodiment of the present invention, selecting the principle of non-encrypted hash algorithm is complementation between each algorithm, avoids using merely
A kind of algorithm will cause Hash collision and Character losing.
In an exemplary embodiment of the present invention, three kinds of non-encrypted hash algorithms are selected, are specifically included:
MurMurHash3 algorithm, SimHash algorithm and CRC32 algorithm.Certainly, those skilled in the art can carry out on this basis
It increases or decreases.Which kind of algorithm is specifically used, is not the emphasis of the embodiment of the present invention, present invention focuses on protections using mixing
Non-encrypted hash algorithm realizes the scheme of the extraction of composite character.
In one particular embodiment of the present invention, described to utilize the selected non-encrypted hash algorithm of N kind to the non-number
Value type feature is handled, and processing result is converted to numeric type feature, is specifically included:
(1) the nonumeric type feature is grouped according to the packet mode of setting;
(2) it is directed to every group of nonumeric type feature, Hash processing is carried out respectively using the non-encrypted hash algorithm of N kind, obtains
Integer is converted into N number of cryptographic Hash, and by obtained N number of cryptographic Hash;
(3) shaped characteristic of each group is spliced, the numeric type feature after being converted.
Step S103 according to the numeric type feature for including in each sample and converts the obtained numeric type feature,
Construction feature matrix;
In one particular embodiment of the present invention, this step specific implementation is as follows:
Each numeric type feature is standardized;
Characteristic after standardization is normalized;
Character matrix is constructed using the characteristic after normalized.
In an alternate embodiment of the present invention where, after construction feature matrix, further includes: according to the dimensionality reduction side of setting
Method carries out dimension-reduction treatment to the eigenmatrix.To reject the not strong characteristic series of obvious correlation, then result inputted into engineering
Classifier is practised to be trained.
Step S104 utilizes the eigenmatrix training machine Study strategies and methods;
In one particular embodiment of the present invention, specific to wrap using the eigenmatrix training machine Study strategies and methods
It includes:
The eigenmatrix constructed using the software sample for being labeled with Malware and normal software, the first machine of training
Study strategies and methods, to classify to software as Malware or normal software;
Utilize the eigenmatrix for being labeled with different types of Malware sample and constructing, training the second machine learning point
Class device is classified with the type to Malware.
That is, the first Machine learning classifiers are two sorting machine learning models, the second Machine learning classifiers are
More sorting machine learning models.
In an alternate embodiment of the present invention where, after the complete Machine learning classifiers of training, further includes:
It is tested using the test sample collection Machine learning classifiers complete to training, to adjust the machine learning
The model parameter of classifier.
Step S105 detects target software using the Machine learning classifiers.
Specifically, feature is extracted in target software in the way of step S102, and will be special in the embodiment of the present invention
Sign input Machine learning classifiers are detected.The process is the test process of real-time online.Above-mentioned S101 to S104 can be
Offline implementation process.
Specifically, in the embodiment of the present invention, Malware and normal software are realized using the first Machine learning classifiers
Classification;The classification of malware type is realized using the second Machine learning classifiers, to realize the detection to target software.
In an alternate embodiment of the present invention where, when detecting target software is Malware, according further to setting
Warning mode alerts.
In conclusion the software detection scheme that the embodiment of the present invention proposes, uses based on the non-encrypted Hash feature of mixing
With the software detecting method of machine learning model, the complex characters string feature extracted from Malware sample can be converted
Training speed is significantly improved to reduce model training difficulty to be easy to the Hash feature of machine learning algorithm processing, is dropped
Low space expense, improves Malware discrimination precision.
In second embodiment of the invention, a kind of software detecting method is provided, compared to the first embodiment, the present embodiment will
Specific embodiments of the present invention process more illustrate in detail in conjunction with example is specifically applied, it should be noted that this
A large amount of technical details disclosed in embodiment are used to explain the present invention, and are not used to uniquely limit the present invention.
Specifically, as shown in Fig. 2, more specifically being provided the embodiment of the invention provides a kind of software detecting method
It is a kind of soft based on the characteristic processing method for mixing non-encrypted hash algorithm, and the malice based on the method and machine learning algorithm
Part detection means.Specifically comprise the following steps:
Step S100: collecting training sample, constructs software sample library;
Specifically, obtaining the Malware sample for machine learning training in the present embodiment, being labeled as black sample, remember
For integer 1, while corresponding number of normal procedure sample is collected, is labeled as white sample, is denoted as integer 0.
It is true and reliable for the Malware black and white sample for how determining collected, in an exemplary implementation of the invention
In example, using the open virus investigation engine library on the website virustotal, (sum about 60 to 70, can use engine quantity according to institute
Scanning file type is different) software sample collected is scanned one by one, discrimination standard is that 50 or more virus investigations are drawn
The Malware that is divided into for holding up detection, do not have a virus investigation engine detection is divided into normal file.Pass through this collection step
500000 Malware samples, 500,000 normal software samples, wherein 400,000 Malwares and 400,000 normal softwares are as training
Data set, 100,000 Malwares and 100,000 normal softwares are as test data set.The program sample collected is mainly PE file,
Or the dll file with similar file structure.The numerous antivirus engines that can use on virustotal simultaneously are soft to malice
Part is classified, take ballot method select the type of most virus investigation softwares identification as the type of Malware in training data and
Family.
Step S200: for the software training sample collected, the data information in each sample is extracted;
Specifically, the information extracted is divided into the embodiment of the present invention: numeric type information (including Boolean type, that is, it is considered as
1) and nonumeric type information (referring mainly to character string type data) 0 and.And check all data informations, to data that may be present
Missing, data dislocation are corrected, it is ensured that obtained data information is completely errorless.
In the embodiment of the present invention, extracted numeric type feature is specifically included: code head file information, code segment information,
Character string statistical information, sample general evaluation system information, function list, export function list, byte count letter in imported address list
Breath and byte information entropy statistics.Specific features type example is as shown in table 1:
The numeric type feature that table 1 extracts
In the present embodiment:
Malicious code head file information, comprising: file virtual size, if be debug mode, if contain signature,
PE timestamps, other numerical informations of PE file header, if storage table containing thread-local;
Code segment information, comprising: whether contain resource section, section area number, zero size code segment number, no name code segment
Number contains " MEM_WRITE " section number;
Character string statistical information, comprising: recognizable character string number, character string average length, printable character string number
Statistics, the sum of all character information entropys;
Sample general evaluation system information, comprising: path identifier " C: " quantity, http (s): // there is sum, " HKEY "
There is quantity, quantity occurs in " MZ ", if contain relocation table, symbol numbers in symbol table;
Function list in imported address list, comprising: imported address list function numbers;
Export function list, comprising: export function numbers;
Byte count information, comprising: byte 0x00 to 0xFF number in whole file, file total bytes;
Byte information entropy statistics, comprising: the comentropy of byte is distributed.
In the embodiment of the present invention, extracted nonumeric type feature include: recognizable character string sequence in software head information,
All path string sequence, all uniform resource locator character string sequences, all registry entry character string sequences,
All name section character string sequences of the machine models character string of software head information, software, entrance name character string, software are all
The character string sequence of continuous Q or more recognizable character composition in section;Wherein, Q is positive integer.In an exemplary embodiment
In, the Q takes 5, but Q is not limited to take the value.Specific features type example is as shown in table 2:
The nonumeric type feature that table 2 extracts
For the description of feature listed by the above Tables 1 and 2 and feature extracting method, if agreement should for numeric type feature
Item is sky, then with the replacement of integer numerical value 0, for nonumeric type feature, if this replaces it to be empty, with character string " 0 ".
Step S300: to above-mentioned nonumeric type feature carry out based on tri- kinds of MurMurHash3, SimHash, CRC32 it is non-plus
The mixing Hash characteristic processing of close hash algorithm converts above-mentioned character string type feature difficult to deal with to the numerical value of regular length
Type eigenmatrix.
Hash (Hash) algorithm is also known as hashing algorithm, i.e., a certain member is mapped to a specific section.It is generally divided into
Cryptographic hashing algorithm and non-encrypted hash algorithm two major classes.Common MD5 algorithm is a kind of cryptographic hashing algorithm, can be incited somebody to action
The character string of random length is mapped as the cryptographic Hash of one 128 (16 byte) by hashing algorithm, has and has a wide range of application, touches
Hit the advantages such as rate is extremely low.However for the characteristic processing of machine learning model, it is not appropriate for using cryptographic hashing algorithm.It is former
Because being that machine learning characteristic processing needs to retain to the greatest extent the general character of primitive character, so as to the energy in training process later
Class discrimination is carried out using these general character.But the cryptographic hashing algorithm of MD5 etc is very sensitive for primitive character variation, only
The only reverse acute variation that will also result in MD5 cryptographic Hash of a bit, destroys information included in original feature, this is to machine
It is very unfavorable for learning training.Therefore, the embodiment of the present invention extracts these non-numbers using non-encrypted type hash algorithm
Value type feature utmostly retains the classification information of primitive character, as a kind of effectively characteristic processing method.
Step S400: using the eigenmatrix obtained in step S300, being trained Machine learning classifiers, obtains machine
Device learning classification model.
Specifically, two sorting machines can be trained to learn mould for the training data for being labeled with Malware and normal file
Type realizes the function of differentiating identification Malware;More sorting machines can be trained for different classes of training data is labeled with
Learning model, realization further distinguish it belongs to which family and type to the file for being determined as Malware.In the present invention
Malware is divided into ad ware (Adware) in embodiment, backdoor programs (Backdoor programs), Trojan Horse
Program (Trojan), destructive computer virus (virus), worm-type virus (worm) extort viral (Ransom), hack tool
(HackTool), rogue software (Rogue), Rootkit, 10 major class such as antivirus tool (Virus Tool).
Machine learning algorithm used in the embodiment of the present invention is LightGBM method, i.e. light weight gradient elevator algorithm.
LightGBM algorithm is a kind of method for improving, can preferably promote original traditional grad enhancement decision Tree algorithms, make its meter
Calculate speed faster, the scope of application is wider, and precision is higher, and hardware spending is smaller.LightGBM has selected determining based on histogram
Plan tree method greatly optimizes memory consumption and calculates cost.Compared with traditional pre-sorted algorithm, it is based on
The algorithm memory consumption of histogram is only 1/8, and finding time complexity on cut-point in decision tree is O (n), but in number
According to compared with pre-sorted algorithm, all features share the same concordance list, therefore only need to this concordance list in segmentation
Operation.Communication cost can be greatly reduced when accelerating training using computer group by lightGBM simultaneously, save parallel
Call duration time between computer, greatly accelerates training process.But the embodiment of the present invention is not related to utilize parallel computer
Cluster is trained.
Step S500: utilizing test sample collection, tests to the obtained Machine learning classifiers of training and Performance Evaluation,
Actual demand can be met to judge the model that training obtains.
Specifically, in the embodiment of the present invention, using 100,000 Malwares and 100,000 normal softwares test verification and measurement ratio and wrong report
Rate, and class test, the accuracy rate of inspection machine Study strategies and methods are carried out to therein 10 extremely evil meaning software samples.
Specific implementation includes: to carry out performance metric to the model after the completion of training using test sample collection, takes accuracy
(accuracy), recall rate (recall rate), ROC curve/AUC etc. performance indicators.In addition by the way of hypothesis testing,
Extensive error is estimated using test error, to obtain the Generalization Capability situation of model.It i.e. can be with according to hypothesis testing result
If being inferred to observe that model A is better than B on test set, then the Generalization Capability of A in statistical significance better than B probability have it is more
It is few.Judge that can be trained model meet the needs of actual use based on the above appraisal procedure, it, then can be with if meet demand
It carries out next step and then returns to the training stage if being unsatisfactory for demand, by adjusting training parameter, increase iteration number,
Different cost functions, regular terms are chosen, the modes such as learning rate improve model performance.
Step S600: being packaged the model after test, to export the machine learning classification for meeting follow-up system processing
Model;
It in an alternate embodiment of the present invention where, is intuitive readable json lattice by machine learning classification model encapsulation
Formula includes model date of formation, types of models, feature name, feature value range, learning rate, sub-tree quantity and each son
Set essential information, feature importance ranking etc.;
It is binary format by machine learning classification model encapsulation, comprising interior in another alternative embodiment of the invention
Appearance is same as above, but is encapsulated using binary system, can greatly accelerate model read speed, for generating the model energy of decision tree substantial amounts
Effectively reduce reading and parsing time.
Step S700: utilizing generated Machine learning classifiers, and it is special to receive externally input software under testing file data
Sign, judges whether it is Malware, is for example, then determine which type Malware it belongs to using family classification model, and
Malware warning is issued in real time;Wherein, warning mode can be selected by user, including and unlimited log, Email, pop-out
Mouthful, the modes such as short message.
In third embodiment of the invention, proposes a kind of characteristic processing method based on the non-encrypted hash algorithm of mixing, be
The implementation process of step S300 in second embodiment is described in detail.Specifically, as shown in figure 3, the method includes such as
Lower step:
Step S301: nonumeric type characteristic is extracted by shown in above-mentioned table 2;
Step S302: duplicate removal denoising is carried out to these nonumeric type features;
Specifically, this step is then emphatically due to being cleaned to all numeric types and nonumeric feature before
Repetition API is detected whether to character string sequence, DLL character string removes possible imperfect API, DLL character string, general API
Function is ended up with .exe, and dll character string is ended up with .dll.
Step S303 is grouped nonumeric type feature, for each group, is all made of step S3041, S3042,
Hash method described in S3043, obtains cryptographic Hash.
In an exemplary embodiment of the present invention, the nonumeric type the 2nd to 8 row in table 2 extracted about PE is special
Sign is divided into one group, forms a character string sequence;The nonumeric type feature point that 9th to 15 row in table 2 is extracted about code segment
It is one group, forms a character string sequence;Most latter two the 16th, 17 row extracts non-about imported address list and export function table
Numeric type feature is divided into one group, forms a character string sequence.
Step S3041: Hash is carried out using nonumeric type feature of the Murmurhash3 algorithm to input.
The features such as Murmurhash is a kind of non-encrypted hash algorithm, has Hash speed fast, low collision rate, cryptographic Hash can
Choosing has 32, and 64,128 place values such as use 128 cryptographic Hash, it is ensured that the Hash under millions data volume according to calculating
Collision probability is almost 0.The embodiment of the present invention illustratively uses cryptographic Hash for 128 Murmurhash3 algorithms.
Specifically, Murmurhash3 algorithm is one group by choosing a sliding window to obtain continuous 2 bit blocks, benefit
With large integer multiplication, shifting function, xor operation, first-order linear transformation, accumulation summation etc. is final to obtain 128 cryptographic Hash.
Step S3042: Hash is carried out using nonumeric type feature of the Simhash to input.
Simhash is a kind of local sensitivity Hash, can be good at retain initial data characteristic information, cryptographic Hash can
It is comparative very strong, it can similarity between preferably more different hash values for example, by using Hamming distances.Simhash is generally used
In the deduplication of magnanimity document, it is used for doing characteristic processing to the character string of extraction in embodiments of the present invention.
It is as follows that the embodiment of the present invention proposed carries out characteristic processing method using Simhash:
(1) 2 byte n-dimensional vectors are converted original character string using 2-gram method, in vector per it is one-dimensional be 2 words
Section.
Such as character string " MSVCP60.dll " is converted into " [MS, SV, VC, CP, P6,60,0. .d, dl, ll] ";
(2) to above-mentioned n-dimensional vector per one weight W of one-dimensional designi(i=0,1..., n-1), if each weight is impartial,
It can set
(3) in n-dimensional vector per one-dimensional carry out Hash, hash method can with unrestricted choice, unlimited encryption or it is non-plus
Close hash algorithm, mainly by wishing that the cryptographic Hash digit mapped determines.The Hash for using MD5 to walk in the embodiment of the present invention as this
Method generates 128 cryptographic Hash;
(4) weight W is subject to by turn to the cryptographic Hash after every one-dimensional Hashi, W is denoted as if the position is 1iIf the position is
0, then it is denoted as-Wi, then n after all weightings is tieed up into cryptographic Hash and is summed by turn, one is obtained per one-dimensional 128 dimensions for real-coded GA
Vector;
(5) in this 128 dimensional vector, if wherein one-dimensional data is greater than threshold value σ, which is denoted as 1, remembers if being less than σ
It is 0, if being equal to σ, is still denoted as 0, then can converts 128 dimension Bit Strings for the 128 dimension floating type vector, as finally
Simhash result.
Wherein, the circular of threshold value σ is as follows:
Wherein BijIt is 1 or 0, W for the bit value of the jth position after vector i-th dimension Hash in step 2iAs step (2) define.
Step S3043: Hash is carried out using nonumeric type feature of the CRC32 to input.
CRC32 is a kind of cyclic redundancy check algorithm, and correctness verifies during being generally used for data frame transfer, the present invention
In be used for for character string being hashing on 32 bit lengths, and carry out characteristic processing with it.The embodiment of the present invention utilizes CRC32
The step of carrying out characteristic processing are as follows:
(1) following generator polynomial is chosen:
C (x)=1+x+x2+x4+x5+x7+x8+x10+x11+x12+x16+x22+x23+x26+x32, 16 system sequences are
0xEDB88320。
(2) for the binary form of original character string sequence using above-mentioned generator polynomial as divisor, mod2 division fortune is done
It calculates, obtained 32 remainders are CRC32 Hash coding.
Step S305: mixing splicing and recombination are carried out to the result obtained using three of the above hash algorithm, to be formed
New feature vector and matrix.In the present embodiment, each group of cryptographic Hash is 128+128+32, and actual storage format is byte class
Type.
As shown in figure 4, mixing splicing proposed by the present invention and recombination method:
For each group of cryptographic Hash, first 128+128 is segmented as unit of byte, and each byte type is turned
Turn to integer, to rear 32 transformation in planta be long, 33 shaped characteristics of each group of formation in this way, then by three grouping sequentially
Splicing forms 99 feature vectors altogether.
Step S306: numeric type feature as described in Table 1 is extracted, totally 641 shaped characteristics.Wherein Boolean type, which is considered as, takes
It is worth the integer of position 0,1.
Step S307: standardization is done to extracted total 740 (that is: 641+99) a feature, to eliminate different spies
Numberical range gap bring between sign influences.Its formulae express are as follows:
Wherein, E (x) is the mean value of this feature, and σ is standard deviation.
Characteristic after standardization is normalized, each characteristic of every a line is mapped to [0,
1] in section.
Step S308: data that treated form the eigenmatrix of dimension M × 740, and input Machine learning classifiers carry out
Training.Wherein, M is the number of sample.
Optionally, in the embodiment of the present invention, data that treated form the eigenmatrix of dimension M × 740, and use Pierre
Gloomy related coefficient, the dimension reduction methods such as Chi-square Test carry out dimensionality reduction to eigenmatrix, reject the not strong characteristic series of obvious correlation, then
Result input Machine learning classifiers are trained.
Corresponding with first embodiment of the invention the method, fourth embodiment of the invention provides a kind of software detection dress
It sets, as shown in figure 5, specifically including:
Characteristic extracting module 510, for extracting numeric type feature that each sample in software sample library is included and nonumeric
Type feature;
Feature processing block 520, for being carried out using the selected non-encrypted hash algorithm of N kind to the nonumeric type feature
Processing, and processing result is converted into numeric type feature;The N is the integer greater than 1;
Matrix construction module 530, described in being obtained according to the numeric type feature for including in each sample with conversion
Numeric type feature, construction feature matrix;
Learning training module 540, for utilizing the eigenmatrix training machine Study strategies and methods;
Detection module 550 detects target software for utilizing the Machine learning classifiers.
Optionally, in the embodiment of the present invention, learning training module 540 is also used to complete to training using test sample collection
The Machine learning classifiers are tested, to adjust the model parameter of the Machine learning classifiers.
Optionally, in the embodiment of the present invention, learning training module 540, specifically for using being labeled with Malware and just
The software sample of normal software and the eigenmatrix constructed, the first Machine learning classifiers of training, using to software as Malware
Or normal software is classified;And the eigenmatrix for being labeled with different types of Malware sample and constructing is utilized,
The second Machine learning classifiers of training, are classified with the type to Malware.
Optionally, in the embodiment of the present invention, the numeric type feature includes one or more of following feature: code head
Field information, code segment information, character string statistical information, sample general evaluation system information, function list, export in imported address list
Function list, byte count information and byte information entropy statistics.
Optionally, in the embodiment of the present invention, the nonumeric type feature includes one or more of following feature: software
Head information in recognizable character string sequence, all path string sequences, all uniform resource locator character string sequences,
All registry entry character string sequence, the machine models character string of software head information, all name section character string sequences of software,
Entrance name character string, the character string sequence that continuous Q or more recognizable character forms in all sections of software;Wherein, Q is positive
Integer.
Optionally, in the embodiment of the present invention, feature processing block 520, be specifically used for by the nonumeric type feature according to
The packet mode of setting is grouped;For every group of nonumeric type feature, carried out respectively using the non-encrypted hash algorithm of N kind
Hash processing, obtains N number of cryptographic Hash, and convert integer for obtained N number of cryptographic Hash;The shaped characteristic of each group is spelled
It connects, the numeric type feature after being converted.
Optionally, in the embodiment of the present invention, matrix construction module 530 is specifically used for carrying out each numeric type feature
Standardization;Characteristic after standardization is normalized;Utilize the characteristic after normalized
Construct character matrix.
Optionally, in the embodiment of the present invention, matrix construction module 530 is also used to after construction feature matrix, according to setting
Fixed dimension reduction method carries out dimension-reduction treatment to the eigenmatrix.
Optionally, in the embodiment of the present invention, the non-encrypted hash algorithm of N kind includes at least two in following algorithm
Kind: MurMurHash3 algorithm, SimHash algorithm and CRC32 algorithm.
The specific implementation process of above-mentioned each module can be found in the first and second embodiment, and this embodiment is not repeated.
In conclusion the software detection scheme that the embodiment of the present invention proposes, uses based on the non-encrypted Hash feature of mixing
With the software detection scheme of machine learning model, the complex characters string feature extracted from Malware sample can be converted
Training speed is significantly improved to reduce model training difficulty to be easy to the Hash feature of machine learning algorithm processing, is dropped
Low space expense, improves Malware discrimination precision.
In the fifth embodiment of the present invention, a kind of software detection device is provided, as shown in fig. 6, specifically including:
Characteristic extracting module 610, for extracting numeric type feature that each sample in software sample library is included and nonumeric
Type feature;
Feature processing block 620, for being carried out using the selected non-encrypted hash algorithm of N kind to the nonumeric type feature
Processing, and processing result is converted into numeric type feature;The N is the integer greater than 1;
Matrix construction module 630, described in being obtained according to the numeric type feature for including in each sample with conversion
Numeric type feature, construction feature matrix;
Learning training module 640, for utilizing the eigenmatrix training machine Study strategies and methods;Optionally, this module
It can be set to off-line module, it is after off-line training that model encapsulation is good, and it is transferred to detection module 670;
File format discrimination module 650, for detect input target software whether be the present apparatus support software format,
If so, triggering characteristic extracting module 610, the file of the target software transmitted is received by characteristic extracting module 610, is mentioned
Numeric type feature included in software and/or nonumeric type feature are taken, file pre-scan module 660 is input to;
File pre-scan module 660, for searching for matched spy according to existing characteristic of malware code library and rule base
Levy code, screening Malware.
Optionally, this module uses traditional condition code matching technique and yara rule match technology.If passing through spy
Sign code and rule match detect Malware, then directly transmit alarm to result record and online alarm module 680;Otherwise,
Feature processing block 620 is triggered, by feature processing block 620 using the selected non-encrypted hash algorithm of N kind to from target software
The nonumeric type feature of middle extraction is handled, and processing result is converted to numeric type feature;And utilize matrix construction mould
The numeric type feature that block 630 is obtained according to the numeric type feature and conversion extracted from target software, construction feature
Matrix, and eigenmatrix is input to detection module 670;
Detection module 670 detects target software for utilizing the Machine learning classifiers.Specifically, this mould
Block is set as, using generated detection and disaggregated model, receiving externally input file data feature to be measured in wire module, sentencing
Whether disconnected is Malware, for example ' is ' then to determine which type Malware it belongs to using family classification model.
As a result it records and online alarm module 680: on line real-time monitoring malware detection as a result, simultaneously issuing in real time
Malware warning, warning mode can select by user, including and unlimited log, Email, pop-up window, the side such as short message
Formula.
In conclusion the software detection scheme that the embodiment of the present invention proposes, uses based on the non-encrypted Hash feature of mixing
With the software detection scheme of machine learning model, the complex characters string feature extracted from Malware sample can be converted
Training speed is significantly improved to reduce model training difficulty to be easy to the Hash feature of machine learning algorithm processing, is dropped
Low space expense, improves Malware discrimination precision.Meanwhile scheme described in the embodiment of the present invention, also swept in advance by file
It retouches module and software is judged in advance, when only can not judging, just input disaggregated model of the present invention, it is further to improve
Identification effect.In addition, the present embodiment is also provided with alarm module, the use body of user is further improved by the module
It tests.
In the sixth embodiment of the present invention, a kind of calculating equipment is provided, as shown in fig. 7, the calculating equipment includes: to deposit
Reservoir 710, processor 720 and communication bus 730;The communication bus 730 for realizing processor 720 and memory 710 it
Between connection communication;
Specifically, processor 720 can be general processor, such as central processing unit in the embodiment of the present invention
(Central Processing Unit, CPU), can also be digital signal processor (Digital Signal
Processor, DSP), specific integrated circuit (English: Application Specific Integrated Circuit,
ASIC), or it is arranged to implement one or more integrated circuits of the embodiment of the present invention.Wherein, memory 710 is for depositing
Store up the executable instruction of the processor 720;
Memory 710 is transferred to processor 520 for storing program code, and by the program code.Memory 710 can
To include volatile memory (Volatile Memory), such as random access memory (Random Access Memory,
RAM);Memory 710 also may include nonvolatile memory (Non-Volatile Memory), such as read-only memory
(Read-Only Memory, ROM), flash memory (Flash Memory), hard disk (Hard Disk Drive, HDD) or solid
State hard disk (Solid-State Drive, SSD);Memory 710 can also include the combination of the memory of mentioned kind.
Specifically, processor 720 is for executing in the application program stored in memory 710 in the embodiment of the present invention
Software checking program, to realize following method and step:
Step 1, the numeric type feature and nonumeric type feature that each sample is included in software sample library are extracted;
Step 2, the nonumeric type feature is handled using the selected non-encrypted hash algorithm of N kind, and will processing
As a result numeric type feature is converted to;The N is the integer greater than 1;
Step 3, the numeric type feature obtained according to the numeric type feature and conversion that include in each sample, construction
Eigenmatrix;
Step 4, the eigenmatrix training machine Study strategies and methods are utilized;
Step 5, using the Machine learning classifiers, target software is detected.
The implementation process of each step can be found in first to 3rd embodiment in the present embodiment, and this embodiment is not repeated.
In seventh embodiment of the invention, a kind of computer readable storage medium, the computer-readable storage medium are provided
Computer program is stored in matter, which realizes following method and step when being executed by processor:
Step 1, the numeric type feature and nonumeric type feature that each sample is included in software sample library are extracted;
Step 2, the nonumeric type feature is handled using the selected non-encrypted hash algorithm of N kind, and will processing
As a result numeric type feature is converted to;The N is the integer greater than 1;
Step 3, the numeric type feature obtained according to the numeric type feature and conversion that include in each sample, construction
Eigenmatrix;
Step 4, the eigenmatrix training machine Study strategies and methods are utilized;
Step 5, using the Machine learning classifiers, target software is detected.
The implementation process of each step can be found in first to 3rd embodiment in the present embodiment, and this embodiment is not repeated.
Wherein, computer storage medium can be RAM memory, flash memory, ROM memory, eprom memory, EEPROM
Memory, register, hard disk, mobile hard disk, CD-ROM or any other form known in the art storage medium.
In embodiment provided herein, it should be understood that disclosed device and method, it can also be by other
Mode realize.The apparatus embodiments described above are merely exemplary, for example, the flow chart and block diagram in attached drawing are shown
Device, the architectural framework in the cards of method and computer program product, function of multiple embodiments according to the present invention
And operation.In this regard, each box in flowchart or block diagram can represent one of a module, section or code
Point, a part of the module, section or code includes one or more for implementing the specified logical function executable
Instruction.It should also be noted that function marked in the box can also be attached to be different from some implementations as replacement
The sequence marked in figure occurs.For example, two continuous boxes can actually be basically executed in parallel, they sometimes may be used
To execute in the opposite order, depending on this is according to related function.It is also noted that every in block diagram and or flow chart
The combination of box in a box and block diagram and or flow chart can use the dedicated base for executing defined function or movement
It realizes, or can realize using a combination of dedicated hardware and computer instructions in the system of hardware.
In addition, each functional module in each embodiment of the present invention can integrate one independent portion of formation together
Point, it is also possible to modules individualism, an independent part can also be integrated to form with two or more modules.
In short, the foregoing is merely illustrative of the preferred embodiments of the present invention, it is not intended to limit the scope of the present invention.
All within the spirits and principles of the present invention, any modification, equivalent replacement, improvement and so on should be included in of the invention
Within protection scope.
Claims (20)
1. a kind of software detecting method characterized by comprising
Extract the numeric type feature and nonumeric type feature that each sample is included in software sample library;
The nonumeric type feature is handled using the selected non-encrypted hash algorithm of N kind, and processing result is converted to
Numeric type feature;The N is the integer greater than 1;
According to the numeric type feature that the numeric type feature and conversion that include in each sample obtain, construction feature matrix;
Utilize the eigenmatrix training machine Study strategies and methods;
Using the Machine learning classifiers, target software is detected.
2. the method as described in claim 1, which is characterized in that utilize the Machine learning classifiers, carried out to target software
Before detection, further includes:
It is tested using the test sample collection Machine learning classifiers complete to training, to adjust the machine learning classification
The model parameter of device.
3. the method as described in claim 1, which is characterized in that utilize the eigenmatrix training machine Study strategies and methods, tool
Body includes:
The eigenmatrix constructed using the software sample for being labeled with Malware and normal software, the first machine learning of training
Classifier, to classify to software as Malware or normal software;
Utilize the eigenmatrix for being labeled with different types of Malware sample and constructing, the second machine learning classification of training
Device is classified with the type to Malware.
4. the method as described in claim 1, which is characterized in that the numeric type feature includes one or more in following feature
It is a: code head file information, code segment information, character string statistical information, sample general evaluation system information, function in imported address list
List, export function list, byte count information and byte information entropy statistics.
5. the method as described in claim 1, which is characterized in that the nonumeric type feature include one in following feature or
It is multiple: recognizable character string sequence, all path string sequences, all uniform resource locator words in software head information
Accord with string sequence, all registry entry character string sequences, the machine models character string of software head information, all name section words of software
Accord with string sequence, entrance name character string, the character string sequence that continuous Q or more recognizable character forms in all sections of software;
Wherein, Q is positive integer.
6. the method as described in claim 1, which is characterized in that described to utilize the selected non-encrypted hash algorithm of N kind to described
Nonumeric type feature is handled, and processing result is converted to numeric type feature, is specifically included:
The nonumeric type feature is grouped according to the packet mode of setting;
For every group of nonumeric type feature, Hash processing is carried out respectively using the non-encrypted hash algorithm of N kind, obtains N number of Kazakhstan
Uncommon value, and integer is converted by obtained N number of cryptographic Hash;
The shaped characteristic of each group is spliced, the numeric type feature after being converted.
7. the method as described in claim 1, which is characterized in that it is described according to the numeric type feature for including in each sample and
The obtained numeric type feature is converted, construction feature matrix specifically includes:
Each numeric type feature is standardized;
Characteristic after standardization is normalized;
Character matrix is constructed using the characteristic after normalized.
8. the method as described in claim 1, which is characterized in that after construction feature matrix, further includes: according to the drop of setting
Dimension method carries out dimension-reduction treatment to the eigenmatrix.
9. method as claimed in any of claims 1 to 8 in one of claims, which is characterized in that the non-encrypted hash algorithm of N kind includes
At least two: MurMurHash3 algorithm, SimHash algorithm and CRC32 algorithm in following algorithm.
10. a kind of software detection device characterized by comprising
Characteristic extracting module, for extracting the numeric type feature and nonumeric type feature that each sample in software sample library is included;
Feature processing block, for being handled using the selected non-encrypted hash algorithm of N kind the nonumeric type feature, and
Processing result is converted into numeric type feature;The N is the integer greater than 1;
Matrix construction module, the numeric type for being obtained according to the numeric type feature and conversion that include in each sample are special
Sign, construction feature matrix;
Learning training module, for utilizing the eigenmatrix training machine Study strategies and methods;
Detection module detects target software for utilizing the Machine learning classifiers.
11. device as claimed in claim 10, which is characterized in that the learning training module is also used to utilize test sample
The Machine learning classifiers for collecting complete to training are tested, to adjust the model parameter of the Machine learning classifiers.
12. device as claimed in claim 10, which is characterized in that the learning training module is labeled with specifically for utilizing
The software sample of Malware and normal software and the eigenmatrix constructed, the first Machine learning classifiers of training, to soft
Part is that Malware or normal software are classified;And it utilizes and is labeled with different types of Malware sample and constructs
Eigenmatrix, training the second Machine learning classifiers, classified with the type to Malware.
13. device as claimed in claim 10, which is characterized in that the numeric type feature include one in following feature or
It is multiple: code head file information, code segment information, character string statistical information, sample general evaluation system information, letter in imported address list
Ordered series of numbers table, export function list, byte count information and byte information entropy statistics.
14. device as claimed in claim 10, which is characterized in that the nonumeric type feature includes one in following feature
It is or multiple: recognizable character string sequence, all path string sequences, all uniform resource locator in software head information
Character string sequence, all registry entry character string sequences, the machine models character string of software head information, all name sections of software
Character string sequence, entrance name character string, the character string sequence that continuous Q or more recognizable character forms in all sections of software
Column;Wherein, Q is positive integer.
15. device as claimed in claim 10, which is characterized in that the feature processing block is specifically used for the non-number
Value type feature is grouped according to the packet mode of setting;For every group of nonumeric type feature, the non-encrypted Hash of N kind is utilized
Algorithm carries out Hash processing respectively, obtains N number of cryptographic Hash, and convert integer for obtained N number of cryptographic Hash;By the integer of each group
Feature is spliced, the numeric type feature after being converted.
16. device as claimed in claim 10, which is characterized in that the matrix construction module is specifically used for each number
Value type feature is standardized;Characteristic after standardization is normalized;Utilize normalized
Characteristic afterwards constructs character matrix.
17. device as claimed in claim 10, which is characterized in that the matrix construction module is also used in construction feature square
After battle array, according to the dimension reduction method of setting, dimension-reduction treatment is carried out to the eigenmatrix.
18. the device as described in any one of claim 10 to 17, which is characterized in that the non-encrypted hash algorithm of N kind
Including at least two: MurMurHash3 algorithm, SimHash algorithm and the CRC32 algorithm in following algorithm.
19. a kind of calculating equipment, which is characterized in that the calculating equipment includes: memory, processor and communication bus;It is described
Communication bus is for realizing the connection communication between processor and memory;
The processor is for executing the software checking program stored in memory, to realize such as any one of claims 1 to 9
The step of described software detecting method.
20. a kind of computer readable storage medium, which is characterized in that be stored with computer on the computer readable storage medium
The step of program, which realizes software detecting method as claimed in any one of claims 1-9 wherein when being executed by processor.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811257390.5A CN109359439B (en) | 2018-10-26 | 2018-10-26 | software detection method, device, equipment and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811257390.5A CN109359439B (en) | 2018-10-26 | 2018-10-26 | software detection method, device, equipment and storage medium |
Publications (2)
Publication Number | Publication Date |
---|---|
CN109359439A true CN109359439A (en) | 2019-02-19 |
CN109359439B CN109359439B (en) | 2019-12-13 |
Family
ID=65346949
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201811257390.5A Active CN109359439B (en) | 2018-10-26 | 2018-10-26 | software detection method, device, equipment and storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109359439B (en) |
Cited By (14)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109992969A (en) * | 2019-03-25 | 2019-07-09 | 腾讯科技(深圳)有限公司 | A kind of malicious file detection method, device and detection platform |
CN110210224A (en) * | 2019-05-21 | 2019-09-06 | 暨南大学 | A kind of mobile software similitude intelligent detecting method of big data based on description entropy |
CN111079164A (en) * | 2019-12-18 | 2020-04-28 | 深圳前海微众银行股份有限公司 | Feature correlation calculation method, device, equipment and computer-readable storage medium |
CN111143670A (en) * | 2019-12-09 | 2020-05-12 | 中国平安财产保险股份有限公司 | Information determination method and related product |
CN111144459A (en) * | 2019-12-16 | 2020-05-12 | 重庆邮电大学 | Class-unbalanced network traffic classification method and device and computer equipment |
CN111352834A (en) * | 2020-02-25 | 2020-06-30 | 江苏大学 | Self-adaptive random test method based on locality sensitive hashing |
CN111581640A (en) * | 2020-04-02 | 2020-08-25 | 北京兰云科技有限公司 | Malicious software detection method, device and equipment and storage medium |
CN112100453A (en) * | 2019-06-18 | 2020-12-18 | 深信服科技股份有限公司 | Method, system, equipment and computer storage medium for character string distribution statistics |
CN112380537A (en) * | 2020-11-30 | 2021-02-19 | 北京天融信网络安全技术有限公司 | Method, device, storage medium and electronic equipment for detecting malicious software |
CN112883375A (en) * | 2021-02-03 | 2021-06-01 | 深信服科技股份有限公司 | Malicious file identification method, device, equipment and storage medium |
CN113254935A (en) * | 2021-07-02 | 2021-08-13 | 北京微步在线科技有限公司 | Malicious file identification method and device and storage medium |
CN113569241A (en) * | 2021-07-28 | 2021-10-29 | 新华三技术有限公司 | Virus detection method and device |
CN114115730A (en) * | 2021-11-02 | 2022-03-01 | 北京银盾泰安网络科技有限公司 | Application container storage engine platform |
CN115221857A (en) * | 2022-09-21 | 2022-10-21 | 中国电子信息产业集团有限公司 | Data similarity detection method and device containing numerical value types |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2007117582A2 (en) * | 2006-04-06 | 2007-10-18 | Smobile Systems Inc. | Malware detection system and method for mobile platforms |
US20120260343A1 (en) * | 2006-09-19 | 2012-10-11 | Microsoft Corporation | Automated malware signature generation |
CN104376262A (en) * | 2014-12-08 | 2015-02-25 | 中国科学院深圳先进技术研究院 | Android malware detecting method based on Dalvik command and authority combination |
CN106778266A (en) * | 2016-11-24 | 2017-05-31 | 天津大学 | A kind of Android Malware dynamic testing method based on machine learning |
CN108595955A (en) * | 2018-04-25 | 2018-09-28 | 东北大学 | A kind of Android mobile phone malicious application detecting system and method |
CN108614970A (en) * | 2018-04-03 | 2018-10-02 | 腾讯科技(深圳)有限公司 | Detection method, model training method, device and the equipment of Virus |
-
2018
- 2018-10-26 CN CN201811257390.5A patent/CN109359439B/en active Active
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2007117582A2 (en) * | 2006-04-06 | 2007-10-18 | Smobile Systems Inc. | Malware detection system and method for mobile platforms |
US20120260343A1 (en) * | 2006-09-19 | 2012-10-11 | Microsoft Corporation | Automated malware signature generation |
CN104376262A (en) * | 2014-12-08 | 2015-02-25 | 中国科学院深圳先进技术研究院 | Android malware detecting method based on Dalvik command and authority combination |
CN106778266A (en) * | 2016-11-24 | 2017-05-31 | 天津大学 | A kind of Android Malware dynamic testing method based on machine learning |
CN108614970A (en) * | 2018-04-03 | 2018-10-02 | 腾讯科技(深圳)有限公司 | Detection method, model training method, device and the equipment of Virus |
CN108595955A (en) * | 2018-04-25 | 2018-09-28 | 东北大学 | A kind of Android mobile phone malicious application detecting system and method |
Cited By (19)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109992969A (en) * | 2019-03-25 | 2019-07-09 | 腾讯科技(深圳)有限公司 | A kind of malicious file detection method, device and detection platform |
CN109992969B (en) * | 2019-03-25 | 2023-03-21 | 腾讯科技(深圳)有限公司 | Malicious file detection method and device and detection platform |
CN110210224A (en) * | 2019-05-21 | 2019-09-06 | 暨南大学 | A kind of mobile software similitude intelligent detecting method of big data based on description entropy |
CN110210224B (en) * | 2019-05-21 | 2023-01-31 | 暨南大学 | Intelligent big data mobile software similarity detection method based on description entropy |
CN112100453A (en) * | 2019-06-18 | 2020-12-18 | 深信服科技股份有限公司 | Method, system, equipment and computer storage medium for character string distribution statistics |
CN111143670A (en) * | 2019-12-09 | 2020-05-12 | 中国平安财产保险股份有限公司 | Information determination method and related product |
CN111144459B (en) * | 2019-12-16 | 2022-12-16 | 重庆邮电大学 | Unbalanced-class network traffic classification method and device and computer equipment |
CN111144459A (en) * | 2019-12-16 | 2020-05-12 | 重庆邮电大学 | Class-unbalanced network traffic classification method and device and computer equipment |
CN111079164A (en) * | 2019-12-18 | 2020-04-28 | 深圳前海微众银行股份有限公司 | Feature correlation calculation method, device, equipment and computer-readable storage medium |
CN111352834A (en) * | 2020-02-25 | 2020-06-30 | 江苏大学 | Self-adaptive random test method based on locality sensitive hashing |
CN111581640A (en) * | 2020-04-02 | 2020-08-25 | 北京兰云科技有限公司 | Malicious software detection method, device and equipment and storage medium |
CN112380537A (en) * | 2020-11-30 | 2021-02-19 | 北京天融信网络安全技术有限公司 | Method, device, storage medium and electronic equipment for detecting malicious software |
CN112883375A (en) * | 2021-02-03 | 2021-06-01 | 深信服科技股份有限公司 | Malicious file identification method, device, equipment and storage medium |
CN113254935A (en) * | 2021-07-02 | 2021-08-13 | 北京微步在线科技有限公司 | Malicious file identification method and device and storage medium |
CN113569241A (en) * | 2021-07-28 | 2021-10-29 | 新华三技术有限公司 | Virus detection method and device |
CN114115730A (en) * | 2021-11-02 | 2022-03-01 | 北京银盾泰安网络科技有限公司 | Application container storage engine platform |
CN114115730B (en) * | 2021-11-02 | 2023-06-13 | 北京银盾泰安网络科技有限公司 | Application container storage engine platform |
CN115221857A (en) * | 2022-09-21 | 2022-10-21 | 中国电子信息产业集团有限公司 | Data similarity detection method and device containing numerical value types |
CN115221857B (en) * | 2022-09-21 | 2023-01-13 | 中国电子信息产业集团有限公司 | Data similarity detection method and device containing numerical value types |
Also Published As
Publication number | Publication date |
---|---|
CN109359439B (en) | 2019-12-13 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109359439A (en) | Software detecting method, device, equipment and storage medium | |
CN109784056B (en) | Malicious software detection method based on deep learning | |
CN110704840A (en) | Convolutional neural network CNN-based malicious software detection method | |
CN110135157B (en) | Malicious software homology analysis method and system, electronic device and storage medium | |
CN110263538B (en) | Malicious code detection method based on system behavior sequence | |
CN107609399A (en) | Malicious code mutation detection method based on NIN neutral nets | |
CN109829306A (en) | A kind of Malware classification method optimizing feature extraction | |
CN110363003B (en) | Android virus static detection method based on deep learning | |
CN111915437A (en) | RNN-based anti-money laundering model training method, device, equipment and medium | |
Chaganti et al. | Image-based malware representation approach with EfficientNet convolutional neural networks for effective malware classification | |
CN111753290A (en) | Software type detection method and related equipment | |
CN110909348A (en) | Internal threat detection method and device | |
Jin et al. | A malware detection approach using malware images and autoencoders | |
CN111400713B (en) | Malicious software population classification method based on operation code adjacency graph characteristics | |
Rahman et al. | Interpreting Machine and Deep Learning Models for PDF Malware Detection using XAI and SHAP Framework | |
Nahhas et al. | Android Malware Detection Using ResNet-50 Stacking. | |
CN112000954B (en) | Malicious software detection method based on feature sequence mining and simplification | |
CN115545091A (en) | Integrated learner-based malicious program API (application program interface) calling sequence detection method | |
Waghmare et al. | A review on malware detection methods | |
CN115842645A (en) | UMAP-RF-based network attack traffic detection method and device and readable storage medium | |
Dai et al. | Anticoncept drift method for malware detector based on generative adversarial network | |
CN113821840A (en) | Bagging-based hardware Trojan detection method, medium and computer | |
CN113609290A (en) | Address recognition method and device and storage medium | |
CN111581640A (en) | Malicious software detection method, device and equipment and storage medium | |
Jiang et al. | A pyramid stripe pooling-based convolutional neural network for malware detection and classification |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |