CN109670306A

CN109670306A - Electric power malicious code detecting method, server and system based on artificial intelligence

Info

Publication number: CN109670306A
Application number: CN201811427170.2A
Authority: CN
Inventors: 高强; 袁宝; 刘宗杰; 马志腾; 乔亚男; 李辉; 陈伦; 张翠珍; 冯庆云; 杨涛; 丛超
Original assignee: State Grid Corp of China SGCC; Jining Power Supply Co of State Grid Shandong Electric Power Co Ltd
Current assignee: State Grid Corp of China SGCC; Jining Power Supply Co of State Grid Shandong Electric Power Co Ltd
Priority date: 2018-11-27
Filing date: 2018-11-27
Publication date: 2019-04-23

Abstract

The invention discloses a kind of electric power malicious code detecting method, server and system based on artificial intelligence.Wherein, electric power malicious code detecting method includes: the entire electric system calling sequence of traversal, obtains call relation frequency feature vector of the call relation between the called number of each API and API as corresponding training sample；Each system in electric system calling sequence is called and is converted into 1-hot vector and inputs the recurrent neural network with LSTM model, converts the hidden state in the recurrent neural network to the fixed length feature vector of corresponding training sample；The call relation frequency feature vector and fixed length feature vector of combined training sample, and be input in the decision tree classifier of training completion and classify, detect electric power malicious code.Its detection efficiency is high.

Description

Electric power malicious code detecting method, server and system based on artificial intelligence

Technical field

The invention belongs to information security detection field more particularly to a kind of electric power Malicious Code Detections based on artificial intelligence Method, server and system.

Background technique

The problem of informationization is the outstanding feature of current social, and the main problem of Information is information security, packet Computer virus, hacker attacks, fishing website, APT (advanced duration threat) etc. are included, to individual privacy, enterprise development, country Safety all may cause serious threat.With the development of computer and network technology, especially information-based and industrialization depth Fusion and the fast development of Internet of Things, industrial control system are facing the threat of advanced virus, Trojan attack, and information security is asked Topic becomes increasingly conspicuous.

2010, Iranian Bushire nuclear power station was by " shake net " virus attack, in 2009 on 2010, the virus Making Bushire nuclear power station is more than that 1000 centrifuges for uranium enrichment damage, and ultimately causes the closing of Bushire nuclear power station.In recent years Come, a lot of security attacks for industrial control system have occurred both at home and abroad, cause bad influence.In December, 2015, crow The attack for electric system has occurred in Crane, results in Ukraine's a wide range of power outages；Attacker's use is accompanied with malicious code Excel Email attachment permeated certain network operation station people system, to electricity grid network implant BlackEnergy malice it is soft Part, obtain to electricity generation system remotely access and control ability.SDG malicious code attack, the evil occurs in July, 2016 Meaning software has targetedly attacked the energy company in Europe.

Industrial control system is a national important infrastructure, is related to nuclear facilities, steel, coloured, chemical industry, petroleum stone Change, electric power, natural gas, advanced manufacture, key water control project, environmental protection, railway, urban track traffic, civil aviaton and urban water supply supply Many national economy fields such as gas heat supply.And it takes place frequently then for the security incident of industrial control system as industrial control system normal table Shade has been coverd in operation, and such event often will affect the facility closely related with national economy and people's lives, bring huge Destructiveness, so that production safety and public safety is faced with huge threat, caused by consequence should not be underestimated.Industry Control system System has become the important component of national critical infrastructures, and the strategy of security relationship to the country of industrial control system is pacified Entirely.

The safety of traditional industry control system depends on the crypticity of its technology, and industrial control system is earliest and looks forward to Industry management system is isolation, does not almost take any safety measure.But in recent years in order to realize real-time data acquisition and raw Control is produced, the convenience for meeting the needs of " two change fusion " and management makes industrial control system and enterprise by way of logic isolation Industry management system can be communicated directly, and enterprise management system is generally directly connected to Internet, in this case, work The range of industry control system access not only extends to enterprise network, but also is faced with the threat from Internet.

Due to the importance and huge interests of power industry control system safety, attacked for the safety of electric power industrial control system Hit more and more, hacker frequently carries out attack by the way of developing the dedicated malicious code of electric power.It detects and takes precautions against for electric power The malicious code of industry becomes an important content in security information for power system field.Since new, proprietary malicious code is got over The demand of the proprietary Malicious Code Detection of current electric power can not be coped with by carrying out faster, traditional virus detection techniques, it is therefore desirable to The technical research of artificial intelligence Malicious Code Detection is carried out based on big data, effectively finds various unknown malwares and proprietary disease Poison effectively improves network safety guard technology, guarantees power network security.

Detecting simultaneously preventing malice code becomes an important content of information security field.Due to the number of current malicious code It measures the speed that very huge, new malicious code occurs also to be getting faster, traditional detection technique detects speed, efficiency due to it The problems such as can not cope with the demand of current Malicious Code Detection.

Summary of the invention

In order to solve the deficiencies in the prior art, the first object of the present invention is to provide a kind of electric power evil based on artificial intelligence Anticipate code detection method, integrate be manually set feature and hidden feature, improve electric power Malicious Code Detection speed and Accuracy.

A kind of electric power malicious code detecting method based on artificial intelligence of the invention, comprising:

Entire electric system calling sequence is traversed, the call relation that each API is called between number and API is obtained and makees For the call relation frequency feature vector of corresponding training sample；

Converting each system calling in electric system calling sequence to 1-hot vector and input has LSTM model Recurrent neural network, by the hidden state in the recurrent neural network be converted into the fixed length feature of corresponding training sample to Amount；

The call relation frequency feature vector and fixed length feature vector of combined training sample, and it is input to determining for training completion Classify in plan Tree Classifier, detects electric power malicious code.

Further, before traversal arranges electric system calling sequence, further includes: construct the API of entire electric system Data dependence graph, detailed process are as follows:

All API of electric system are numbered, system call sequence is obtained；

According to data flow dependence in system call sequence, establish the API data of the entire electric system of aeoplotropism according to Lai Tu.

Wherein, system calls API to represent the interaction between operating system and application program, is rogue program behavioral value Primary analysis data.Due to the call relation between API than individual API Calls number more can response procedures behavior it is special Sign, thus we will utilize API data dependency graph the advantages of, reflect API Calls relationship using API data dependency graph, still The computation complexity taken statistical system call relation frequency rather than use figure mining algorithm that can excavate to avoid figure.

Further, it during traversing entire electric system calling sequence, is called for the system currently traversed, Relative system is searched for forward in the sequence to call；

When often searching a system calling, compared with the system currently traversed calls progress parameter, two systems are judged The correlation of calling；

For the parameter that each system is called, use more corresponding cryptographic Hash compares one by one.

Wherein, the parameter called for each system carries out during comparing one by one, does not use direct comparison, but Using comparing its cryptographic Hash, this have the advantage that, for content-length very big parameter, such as Buffer etc., do not need All the elements that parameter is all traversed in each compare, improve the efficiency that parameter compares in this way.

Further, each system calls the detailed process for being converted into 1-hot vector are as follows:

A dictionary is created, is called and is corresponded with ID and system；

It converts system calling to other than its correspondence position ID is 1, remaining position is all 0 vector.

Further, it during the decision tree classifier completed using training is classified, is measured using AUC every Correlation between a feature and classification.

The second object of the present invention is to provide a kind of electric power Malicious Code Detection server based on artificial intelligence, synthesis The feature and hidden feature of artificial settings improve the speed and accuracy of electric power Malicious Code Detection.

A kind of electric power Malicious Code Detection server based on artificial intelligence of the invention, comprising:

Call relation frequency feature vector obtains module, is configured as: traversing entire electric system calling sequence, obtains Call relation frequency feature vector of the call relation that each API is called between number and API as corresponding training sample；

Fixed length feature vector obtains module, is configured as: each system in electric system calling sequence being called equal It is converted into 1-hot vector and inputs the recurrent neural network with LSTM model, by the hiding shape in the recurrent neural network State is converted into the fixed length feature vector of corresponding training sample；

Code classification module is configured as: the call relation frequency feature vector and fixed length feature of combined training sample Vector, and be input in the decision tree classifier of training completion and classify, detect electric power malicious code.

Further, the electric power Malicious Code Detection server based on artificial intelligence further include:

API data dependency graph constructs module, is configured as: before traversal arranges electric system calling sequence, to electricity All API of Force system are numbered, and obtain system call sequence；

Further, it obtains in module in the call relation frequency feature vector, is called traversing entire electric system It during sequence, is called for the system currently traversed, searches for relative system forward in the sequence and call；

Further, it is obtained in module in the fixed length feature vector, each system calling is converted into 1-hot vector Detailed process are as follows:

A dictionary is created, is called and is corresponded with ID and system；

It converts system calling to other than its correspondence position ID is 1, remaining position is all 0 vector；

Or/and

In the code classification module, during the decision tree classifier completed using training is classified, make The correlation between each feature and classification is measured with AUC.

The third object of the present invention is to provide a kind of electric power malicious code detection system based on artificial intelligence, integrates people For the feature and hidden feature of setting, the speed and accuracy of electric power Malicious Code Detection are improved.

A kind of electric power malicious code detection system based on artificial intelligence of the invention, including described above based on artificial The electric power Malicious Code Detection server of intelligence.

Compared with prior art, the beneficial effects of the present invention are:

The present invention is by obtaining the call relation between the called number of each API and API as corresponding training sample Call relation frequency feature vector；Then, by electric system calling sequence each system calling be converted into 1-hot to The recurrent neural network with LSTM model is measured and inputted, converts corresponding instruction for the hidden state in the recurrent neural network Practice the fixed length feature vector of sample；The call relation frequency feature vector and fixed length feature vector of combined training sample, comprehensive people It for the feature and hidden feature of setting, and is input in the decision tree classifier of training completion and classifies, improve electric power evil The speed and accuracy for code detection of anticipating.

Detailed description of the invention

The accompanying drawings constituting a part of this application is used to provide further understanding of the present application, and the application's shows Meaning property embodiment and its explanation are not constituted an undue limitation on the present application for explaining the application.

Fig. 1 is a kind of electric power malicious code detecting method embodiment flow chart based on artificial intelligence of the invention.

Fig. 2 is API data dependency graph.

Fig. 3 is vulnerability scan fusion process schematic diagram.

Fig. 4 is a kind of electric power Malicious Code Detection server architecture schematic diagram based on artificial intelligence of the invention.

Specific embodiment

It is noted that following detailed description is all illustrative, it is intended to provide further instruction to the application.Unless another It indicates, all technical and scientific terms used herein has usual with the application person of an ordinary skill in the technical field The identical meanings of understanding.

It should be noted that term used herein above is merely to describe specific embodiment, and be not intended to restricted root According to the illustrative embodiments of the application.As used herein, unless the context clearly indicates otherwise, otherwise singular Also it is intended to include plural form, additionally, it should be understood that, when in the present specification using term "comprising" and/or " packet Include " when, indicate existing characteristics, step, operation, device, component and/or their combination.

Term is explained:

API (Application Programming Interface, application programming interface) is some pre-defined Function, it is therefore an objective to provide application program and developer based on certain software or hardware be able to access one group of routine ability, and It is not necessarily to access source code again, or understands the details of internal work mechanism.

LSTM (Long Short-Term Memory) is shot and long term memory network, is a kind of time recurrent neural network, It is suitable for being spaced and postpone relatively long critical event in processing and predicted time sequence.

AUC (Area Under Curve) is defined as the area under ROC curve, it is clear that the numerical value of this area will not be big In 1.Top due to ROC curve generally all in this straight line of y=x again, so the value range of AUC is between 0.5 and 1. Using AUC value as evaluation criterion is because many times ROC curve can not clearly illustrate the effect of which classifier more It is good, and as a numerical value, corresponding A UC bigger classifier effect is more preferable.

Since security breaches database is all to convert vulnerability information to html format on respective website to announce, because This, sample used by the embodiment of the present invention is locally downloading from database website by these pages using crawlers, so The every terms of information of loophole is restored by parsing HTML afterwards, and this loophole is added in local data base.It is marked based on HTML The logical construction of label.The Nested property of html tag allows entire chapter html document to be considered as one tree, wherein if B label It is nested in A label, then B label is the node that the subtree of root is designated as with A.This is also the base of DOM Document Object Model (DOM) This thought.After HTML is converted into one tree, some node of tree just can be accurately positioned using XML Path Language (XPath), To obtain the position of the information to be extracted in the tree, the text of information can be obtained by using simple processing.This side Method is also to define an abstraction rule for each database, but these rules need not be embedded in program code, but with configuration The form of file is stored in outside program.This is just provided with certain flexibility, after the publication change of format of loophole, it is only necessary to Change configuration file can adapt to this variation.

There are redundancies between vulnerability scan.It both include itself distinctive data between vulnerability scan, also comprising mutual The data of coincidence；The existing part in the source of data independently obtains, and also has certain proportion from other vulnerability databases.Therefore it leaks The fusion of hole database is not intended to be simple data and piles up, and different entries should not be directed toward same in fused data Actual loophole, and each loophole record strip purpose uniqueness should be met.And there presently does not exist judge loophole identity Standard.

There are isomeries between vulnerability scan.In different vulnerability databases, the expression way of the text type field of loophole There are larger differences, and for the noun of the same meaning, when appearing in different vulnerability databases, it is same that software is difficult to its Property.Therefore data fusion is caused to be difficult to automate, the progress of mass, and if completed one by one by the way of artificial, work Amount will be very huge, and inevitable subjectivity.

The isomery situation analysis of text type field: since the representation style of each vulnerability database is different, vulnerability scan it Between there are serious isomery, for the same word or content, literary style is different in different libraries.About loophole description etc. its His field, equally exists difference in literary style.

The analysis of loophole reference link: the topological structure that the leaky reference link of institute is quoted in analysis vulnerability database utilizes this It is as follows that topological structure summarizes relationship that may be present, prevailing relationship between loophole.

1) " identical ": data fusion is carried out by reference to link, in the ideal case to be one-to-one, such as one in NVD Loophole A refers to a loophole B in X_Force, and B does not quote other loopholes in NVD other than A again, does not have yet Again cited in other loopholes in NVD other than A, such case is the one-to-one reference of reference link.It is recognized herein that In this case, the relationship between 2 is " identical ".

2): often there is multi-to-multi in " correlation ", the loophole of a such as NVD can quote multiple X_Force in practice Loophole entry, and the X_Fore loophole entry that these are cited can quote the loophole entry of other NVD, if there is this Very big error will be present by reference to the data of link fusion in situation.In this case, 2 loopholes may be the same leakage Hole, it is also possible to not be to be defined as " correlation ".

3) " unrelated ": the case where " identical " is to except " related " is defined as " unrelated ".

Loophole threat data is merged in terms of 6:

Loophole data collection module: the vulnerability information of various aspects is collected by the method for collecting loophole above.

Loophole field processing module: loophole field data cleaning.After obtaining current loophole field, it is necessary first to which standardization is every The character string of loophole field, including write, handle irregular character, spcial character and unified capital and small letter etc..It is mentioned for the next stage For the data source of rule.Loophole field participle.It is segmented according to space, removes stop words, grammer processing.Loophole field feature mentions It takes.Calculate loophole field weight.The calculation method formula of weight is T=IF*IDF, and T indicates power of the word in present field Weight, IF indicate the word frequency of word, and IDF indicates the reversed word frequency of word.Generate field search characteristic value.It is calculated according to previous step Word gravity treatment take retrieval the higher character string of recall ratio as characteristic value.

Loophole reference link processing module: the module extracts the reference link in loophole data, utilizes " loophole field processing Module " is handled, and reference link topological relation library is established, according to the result in relationship library by 2 loophole data to be fused Loophole relationship between library is divided into " identical ", " correlation " and " unrelated " 3 class.

Loophole product type training set processing module: the loophole of " identical " relationship is extracted as training set, extracts these leakages The product type in hole is handled using loophole field processing module, obtains the most optimized parameter of training pattern.

Loophole product type test set processing module: the loophole of " correlation " relationship is extracted as test set, extracts these leakages The product type in hole is handled using loophole field processing module, and the optimized parameter that combined training collection generates calculates product type The sign character string degree of correlation establishes unified manufacturer and product type library.

Unified database constructs module: according to loophole item between each vulnerability scan of loophole uniformity decision rule comprehensive judgement Purpose correlation circumstance eliminates the redundancy between database, merges each library data, constructs unified database.Specifically, that is, it leaks Hole correlation decision rule extracts the loophole of " identical " relationship first, directly merges, and then extracts the loophole of " correlation " relationship, It is whether identical to determine whether merging according to manufacturer and product type.

Vulnerability scan fusion process is as shown in figure 3, the data set for first acquiring open vulnerability database, reverse patch and leakage This 3 loophole data sets of the data set that hole digging system obtains are divided into 2 groups, are patch+disclosure and bug excavation+disclosure respectively. Text matches are carried out to the loophole reference link of every group of database, are obtained with identical loophole item in 2 vulnerability scans in organizing Mesh and relevant loophole entry.It is merged to obtain final unified database two-by-two again.

Embodiment one

As shown in Figure 1, a kind of electric power malicious code detecting method based on artificial intelligence, specifically includes:

Step 1: traversing entire electric system calling sequence, obtain the calling that each API is called between number and API Call relation frequency feature vector of the relationship as corresponding training sample.

Existing rogue program behavioral value method is mainly based upon individual system and calls frequency construction feature vector, program The frequency that system calling is performed at runtime is capable of the behavior characteristic information of characterization program.If the program of a certain type may Certain systems can be called using more frequently than another type of program, then can use this feature and classify.But this Kind method has ignored the connection between not homologous ray calling, is equivalent to and has lost many behavioural characteristics.

Due to the call relation between API than individual API Calls number more can response procedures behavioural characteristic, I Will utilize API data dependency graph the advantages of, reflect API Calls relationship using API data dependency graph, but take department of statistic Unite call relation frequency rather than use figure mining algorithm can to avoid figure excavate computation complexity.

Specifically, before traversal arranges electric system calling sequence, further includes: construct the API number of entire electric system According to dependency graph, detailed process are as follows:

All API of electric system are numbered, system call sequence is obtained；

Define (API data dependency graph): an API data dependency graph is a digraph G (V, E), represents system calling Between data flow dependence, V be system call set, E is one group of directed edge.Directed edge (x, y) indicates to call from x system It is called to y system, shows that the output of x calling system calls y to use by system, as shown in Figure 2.

Association analysis is called by system, can establish API data dependency graph.

By in system call sequence, comparing whether the system calling that two time of origins close on has identical parameter, It is called if so, being then considered as associated system.From the locality characteristic of program, a system is called previous multiple with it System calls correlation, but degree of correlation reduces with the increase of distance between them, so, search for a certain range of system It calls to find associated system calling pair.

Calling location A such as system is L₁, it is L that system, which calls the position of B,₂, it is S that system, which calls the parameter sets of A,_A, system tune It is S with the parameter sets of B_B, seek S_A∪S_B。

WhenWhen, indicate that system calls A to call B relevant with system, so using association to A → B To describe this relationship.

In specific implementation, during traversing entire electric system calling sequence, for the system currently traversed It calls, searches for relative system forward in the sequence and call；

The number called between statistics any two system calling, indicates x API Calls y API's with C (x, y) Number；

The called number of each API is counted, indicates the called number of y API with C (y, y), here includes that x is called Y also includes that y calls t；

Assuming that there is n system to call, the vector M of a n × n is established according to system calling dependency graph；

M=(C (1,1), C (1,2) ..., C (1, n), C (2,1), C (2,2) ..., C (2, n) ..., C (n, 1), C (n, 2),…,C(n,n))。

Feature vector M had not only included the called number of each API, but also can reflect the call relation between API, therefore retouch It is more comprehensively and deep to draw program behavior feature.

Wherein, system calls API to represent the interaction between operating system and application program, is rogue program behavioral value Primary analysis data.

Step 2: converting each system calling in electric system calling sequence to 1-hot vector and input has The recurrent neural network of LSTM model converts the hidden state in the recurrent neural network to the fixed length of corresponding training sample Feature vector.

The system call sequence of indefinite length when its operation is obtained by sample behavior tracking, such data are not available It is trained in traditional neural network.And recurrent neural network is utilized, it can use the characteristic of recurrent neural network, pass through sequence Column prediction, information included in sequence is included in hidden state, converts fixed length for finally obtained hidden state Feature vector, the fixed length feature vector for obtaining all samples of method processing, inputs convolutional neural networks like this, carries out feature and mentions It takes, then classifies.

Sample is when playing its function, it is necessary to be interacted by API Calls with operating system, such as open text Part.The dynamic behaviour of sample is recorded by recording the API Calls sequence of sample.The specific behaviour of single API Calls performance Make, multiple API Calls show the activity of sample, and all API Calls sequences are demonstrated by the global behavior of sample.This level Structure is similar to article, and article is made of sentence, and sentence is made of word, our all analogy language models use recurrent neural Network extracts the feature of sample.It completes to extract feature from hidden state final in recurrent neural network after training Vector.

Sample behavioural characteristic is contained in the feature vector extracted.Convolutional neural networks are applied in these feature vectors On, a classifier is trained, distinguishes whether sample is Malware.

Based on the system call sequence log got, we construct sample behavior " language " model.Model uses tool There is the recurrent neural network of LSTM unit.Recurrent neural network includes input layer an x, a common hidden layer h¹, two layers LSTM layers of h²And h³And output layer y.

In the training process, 1-hot vector is converted by system calling first.Method for transformation is as follows:

1) a dictionary is created, is called and is corresponded with ID and system.

2) system is called and is converted, other than its correspondence position ID is 1, remaining position is all 0 vector.

Recurrent neural network using system call sequence by being repeatedly trained.Firstly, we call system Sequence Transformed is 1-hot vector, each vector x_tIt sequentially inputs recurrent neural network and obtains output y_t.Then by y_tWith x_t+1Into Row compares, and calculates loss function.After executing specified number, recurrent neural network is updated using back-propagation algorithm Parameter.Finally when extracting feature vector, the hidden state that uses recurrent neural network last is as feature vector.

In the training stage, using it is labeled whether be that the feature vector of Malware is trained, in advance by feature Vector is pictured, is converted into square feature matrix, size W₀*W₀.Convolutional neural networks include an input layer, two volumes Product pond layer and a full articulamentum, an output layer.

First convolutional layer filters W using 10 convolution kernels₀×W₀× 1 input feature vector matrix.

Second convolutional layer, 20 cores, filter previous output W₁×W₁× 10 eigenmatrix.

Each pond layer is sliding size with 2, receives the output of a upper convolutional layer, and it is reduced to 1/2 by size.Again By a full articulamentum, the output layer of 2 dimensions, as classifier result are finally obtained.

Whether final sample is that the probability of Malware is calculated by Sigmod function.

Due to design the LSTM model of suitable target, it is necessary to assure the model of design be it is convergent, needed thus from number Angle is proved.

For the ease of the expression of matrix operation, two common formula are enumerated:

Frobenius product. is inner product, and all respective items are summed after being multiplied；

Two matrixes of Hadamard product. are multiplied item by item, are as a result still matrix.

LSTM formula is as follows, and wherein σ is sigmoid function, is to ask σ, tanh one by one to each of which element to Matrix Calculating σ Function is same.x_tInput when for time series t, h_t-1For last time output, W, U, b is respectively corresponding weight and biasing, is Need training parameter.

LSTM is the model for determining memory unit information and how flowing, and is mainly provided with 3 doors.Respectively forget control door f_t, input control door i_t, output control door o_t, 3 doors are all in accordance with input x_t,h_t-1, using linear transformation, obtain a weight square Battle array (or vector), i.e. f_t、i_t、o_t, due to using sigmoid function, so that each element value range is 0 to 1, it is correspondingFortune When calculation, the weight of multiplication object elements.For the memory unit state value newly obtained.

In order to obtain final memory unit state c_t,It is middle that really good state is obtained using input control,It is middle that unwanted contributions are filtered out from old recall info according to forgetting control, it is single that newest memory is obtained according to sum of the two First c_t.It finally needs to determine to export, memory unit state value is limited to [- 1 ,+1] section using tanh, is weighed using out gate The specific output h of value control_t。

i_t=σ (W_ix_t+U_ih_t-1+b_i)

f_t=σ (W_fx_t+U_fh_t-1+b_f)

o_t=σ (W_cx_t+U_ch_t-1+b_c)

All output h of LSTM_tMean value inputted as classifier, using simple sigmoid linear classification model, Use cross entropy as loss function, required target is to minimize cross entropy.

The model can regard LSTM as and add linear classifier, and LSTM retraining after parameter training is good can first be trained to classify Device can also be trained as a whole with LSTM and classifier, using the strategy taken in deeplearning, integrally be instructed Practice, needs to know each gradient updating formula before training.

Step 3: the call relation frequency feature vector and fixed length feature vector of combined training sample, and be input to and trained At decision tree classifier in classify, detect electric power malicious code.

The decision tree classifier that training is completed is Multiple trees classifier (random forest) or iteration decision tree (GBDT).

When being classified with decision tree, it may appear that data imbalance problem, wherein small numbers of class is known as minority Class；The more class of number is known as most classes.

And concentrated in two-category data, claim most classes to be positive class, minority class is claimed to be negative class.For two classification problems, give The negative higher weight of class, then classifier when being trained, can make bigger punishment to the behavior of wrong point minority class sample, Thus force finally obtained classifier that there is higher predictablity rate to minority class sample.

In two class imbalance classification problems, it is desirable to improve the value of categorizing system by feature selecting.

ROC (Receiver Operating Characteristic) curve, that is, recipient's operating characteristic curve is a kind of The tool of graph-based classifier performance.ROC is initially used in signal detection problem, is with two kinds of displaying different mistakes Tradeoff between discrimination and rate of false alarm.In recent years, ROC curve and its relevant technologies are introduced in deep learning field.

In two classification problems, the continuous type of most of classifiers exports result: h (x) → R can regard classifier pair as A possibility that sample x belongs to the marking of positive class possibility, and h (x) is bigger, and x belongs to positive class is bigger, conversely, h (x) is smaller, then belongs to It is bigger in the negative class a possibility that.In traditional classifier building process, people usually can with 0.5 as default threshold value. I.e. as h (x) > 0.5, x is divided into positive class, on the contrary then be assigned to negative class.In practice, effect and bad.Appropriate does Method is one optimal threshold θ of searching within the scope of the codomain of h (x), as the separation of positive and negative class, i.e.,

During continuous adjustment, a series of different H (x) and its corresponding (positive class discrimination, negative class mistake can be obtained Divide rate) point pair.The line of these points is being exactly ROC curve.Patterned ROC curve intuitively observes classifier convenient for people Performance, it is but not easy to the comparison between classifier.For this purpose, people often use area under a curve as measurement classifier quality Measurement, referred to as AUC.

N number of sample is sorted from large to small according to the h (x) a possibility that belonging to positive class that classifier is predicted.Then to sequence Sample later is numbered from small to large, and the smallest sample number of h value is that the maximum sample number of 1, h value is N.For sample h value The average value of their number is then assigned to each sample therein by equal situation.

The correlation between each feature and category label is measured using AUC.

Firstly, selecting coefficient of rank correlation (Speakerman ' srand correlation coefficient, RCC) weighing apparatus Measure two features (such as f₁And f₂) between redundancy.Removal redundancy feature can significantly improve the performance of algorithm.In calculating In the process, first, in accordance with f₁Value sorted from small to large to all samples, it is assumed that i-th of sample is in this sequence Number is p₁(f₁1) the upper the smallest sample number of value is.

Secondly, according to f₂It sorts from small to large to all samples, it is assumed that number of i-th of sample in this sequence is p₂。 Then,

d_i=p₁-p₂

Then, f₁And f₂Between coefficient of rank correlation can be calculated with following equation

Specific algorithm is as follows:

Input: data set D, the character subset size k that user specifies

Output: selected character subset S

Electric power malicious code detecting method based on artificial intelligence of the invention by obtain each API be called number with And call relation frequency feature vector of the call relation between API as corresponding training sample；Then, electric system is called Each system in sequence, which is called, to be converted into 1-hot vector and inputs the recurrent neural network with LSTM model, will be described Hidden state in recurrent neural network is converted into the fixed length feature vector of corresponding training sample；The calling of combined training sample is closed It is frequency feature vector and fixed length feature vector, integrates the feature being manually set and hidden feature, and is input to training completion Classify in decision tree classifier, improves the speed and accuracy of electric power Malicious Code Detection.

Embodiment two

As shown in figure 4, a kind of electric power Malicious Code Detection server based on artificial intelligence of the invention, comprising:

(1) call relation frequency feature vector obtains module, is configured as: traversing entire electric system calling sequence, Obtain call relation frequency feature of the call relation between the called number of each API and API as corresponding training sample Vector.

Specifically, it is obtained in module in the call relation frequency feature vector, calls sequence traversing entire electric system It during column, is called for the system currently traversed, searches for relative system forward in the sequence and call；

(2) fixed length feature vector obtains module, is configured as: each system in electric system calling sequence is called It is converted into 1-hot vector and inputs the recurrent neural network with LSTM model, by hiding in the recurrent neural network Condition conversion is the fixed length feature vector of corresponding training sample.

Specifically, it is obtained in module in the fixed length feature vector, each system calls the tool for being converted into 1-hot vector Body process are as follows:

A dictionary is created, is called and is corresponded with ID and system；

(3) code classification module is configured as: the call relation frequency feature vector and fixed length of combined training sample are special Vector is levied, and is input in the decision tree classifier of training completion and classifies, detects electric power malicious code.

Specifically, in the code classification module, in the mistake that the decision tree classifier completed using training is classified Cheng Zhong measures the correlation between each feature and classification using AUC.

In another embodiment, the electric power Malicious Code Detection server based on artificial intelligence further include:

Electric power Malicious Code Detection server based on artificial intelligence of the invention is by obtaining the called number of each API And call relation frequency feature vector of the call relation between API as corresponding training sample；Then, by electric system tune It is called with each system in sequence and is converted into 1-hot vector and inputs the recurrent neural network with LSTM model, by institute State the fixed length feature vector that the hidden state in recurrent neural network is converted into corresponding training sample；The calling of combined training sample Relationship frequency feature vector and fixed length feature vector, the comprehensive feature being manually set and hidden feature, and it is input to trained completion Decision tree classifier in classify, improve the speed and accuracy of electric power Malicious Code Detection.

The present invention also provides a kind of electric power malicious code detection system based on artificial intelligence, what synthesis was manually set Feature and hidden feature improve the speed and accuracy of electric power Malicious Code Detection.

A kind of electric power malicious code detection system based on artificial intelligence of the invention, including as shown in Figure 4 based on people The electric power Malicious Code Detection server of work intelligence.

Electric power malicious code detection system based on artificial intelligence of the invention by obtain each API be called number with And call relation frequency feature vector of the call relation between API as corresponding training sample；Then, electric system is called Each system in sequence, which is called, to be converted into 1-hot vector and inputs the recurrent neural network with LSTM model, will be described Hidden state in recurrent neural network is converted into the fixed length feature vector of corresponding training sample；The calling of combined training sample is closed It is frequency feature vector and fixed length feature vector, integrates the feature being manually set and hidden feature, and is input to training completion Classify in decision tree classifier, improves the speed and accuracy of electric power Malicious Code Detection.

It should be understood by those skilled in the art that, the embodiment of the present invention can provide as method, system or computer program Product.Therefore, the shape of hardware embodiment, software implementation or embodiment combining software and hardware aspects can be used in the present invention Formula.Moreover, the present invention, which can be used, can use storage in the computer that one or more wherein includes computer usable program code The form for the computer program product implemented on medium (including but not limited to magnetic disk storage and optical memory etc.).

The present invention be referring to according to the method for the embodiment of the present invention, the process of equipment (system) and computer program product Figure and/or block diagram describe.It should be understood that every one stream in flowchart and/or the block diagram can be realized by computer program instructions The combination of process and/or box in journey and/or box and flowchart and/or the block diagram.It can provide these computer programs Instruct the processor of general purpose computer, special purpose computer, Embedded Processor or other programmable data processing devices to produce A raw machine, so that being generated by the instruction that computer or the processor of other programmable data processing devices execute for real The device for the function of being specified in present one or more flows of the flowchart and/or one or more blocks of the block diagram.

These computer program instructions, which may also be stored in, is able to guide computer or other programmable data processing devices with spy Determine in the computer-readable memory that mode works, so that it includes referring to that instruction stored in the computer readable memory, which generates, Enable the manufacture of device, the command device realize in one box of one or more flows of the flowchart and/or block diagram or The function of being specified in multiple boxes.

These computer program instructions also can be loaded onto a computer or other programmable data processing device, so that counting Series of operation steps are executed on calculation machine or other programmable devices to generate computer implemented processing, thus in computer or The instruction executed on other programmable devices is provided for realizing in one or more flows of the flowchart and/or block diagram one The step of function of being specified in a box or multiple boxes.

Those of ordinary skill in the art will appreciate that realizing all or part of the process in above-described embodiment method, being can be with Relevant hardware is instructed to complete by computer program, the program can be stored in a computer-readable storage medium In, the program is when being executed, it may include such as the process of the embodiment of above-mentioned each method.Wherein, the storage medium can be magnetic Dish, CD, read-only memory (Read-Only Memory, ROM) or random access memory (Random AccessMemory, RAM) etc..

The foregoing is merely preferred embodiment of the present application, are not intended to limit this application, for the skill of this field For art personnel, various changes and changes are possible in this application.Within the spirit and principles of this application, made any to repair Change, equivalent replacement, improvement etc., should be included within the scope of protection of this application.

Claims

1. a kind of electric power malicious code detecting method based on artificial intelligence characterized by comprising

Entire electric system calling sequence is traversed, obtains the call relation between the called number of each API and API as phase Answer the call relation frequency feature vector of training sample；

Each system in electric system calling sequence is called to be converted into 1-hot vector and input, there is passing for LSTM model Return neural network, converts the hidden state in the recurrent neural network to the fixed length feature vector of corresponding training sample；

The call relation frequency feature vector and fixed length feature vector of combined training sample, and it is input to the decision tree of training completion Classify in classifier, detects electric power malicious code.

2. a kind of electric power malicious code detecting method based on artificial intelligence as described in claim 1, which is characterized in that time It goes through before arranging electric system calling sequence, further includes: construct the API data dependency graph of entire electric system, detailed process Are as follows:

All API of electric system are numbered, system call sequence is obtained；

According to data flow dependence in system call sequence, the API data dependency graph of the entire electric system of aeoplotropism is established.

3. a kind of electric power malicious code detecting method based on artificial intelligence as described in claim 1, which is characterized in that time During going through entire electric system calling sequence, called for the system that currently traverses, in the sequence forward search and its Relevant system is called；

When often searching a system calling, compared with the system currently traversed calls progress parameter, judge that two systems are called Correlation；

4. a kind of electric power malicious code detecting method based on artificial intelligence as described in claim 1, which is characterized in that each System calls the detailed process for being converted into 1-hot vector are as follows:

A dictionary is created, is called and is corresponded with ID and system；

5. a kind of electric power malicious code detecting method based on artificial intelligence as described in claim 1, which is characterized in that adopting During being classified with the decision tree classifier that training is completed, the phase between each feature and classification is measured using AUC Guan Xing.

6. a kind of electric power Malicious Code Detection server based on artificial intelligence characterized by comprising

Call relation frequency feature vector obtains module, is configured as: traversing entire electric system calling sequence, obtains each Call relation frequency feature vector of the call relation that API is called between number and API as corresponding training sample；

Fixed length feature vector obtains module, is configured as: each system calling in electric system calling sequence is converted For 1-hot vector and the recurrent neural network with LSTM model is inputted, the hidden state in the recurrent neural network is turned Turn to the fixed length feature vector of corresponding training sample；

Code classification module is configured as: the call relation frequency feature vector and fixed length feature vector of combined training sample, And be input in the decision tree classifier of training completion and classify, detect electric power malicious code.

7. a kind of electric power Malicious Code Detection server based on artificial intelligence as claimed in claim 6, which is characterized in that institute State the electric power Malicious Code Detection server based on artificial intelligence further include:

API data dependency graph constructs module, is configured as: before traversal arranges electric system calling sequence, to power train All API of system are numbered, and obtain system call sequence；

8. a kind of electric power Malicious Code Detection server based on artificial intelligence as claimed in claim 6, which is characterized in that The call relation frequency feature vector obtains in module, during traversing entire electric system calling sequence, for working as Before the system that traverses call, search for relative system forward in the sequence and call；

9. a kind of electric power Malicious Code Detection server based on artificial intelligence as claimed in claim 6, which is characterized in that The fixed length feature vector obtains in module, and each system calls the detailed process for being converted into 1-hot vector are as follows:

A dictionary is created, is called and is corresponded with ID and system；

Or/and

In the code classification module, during the decision tree classifier completed using training is classified, AUC is used To measure the correlation between each feature and classification.

10. a kind of electric power malicious code detection system based on artificial intelligence, which is characterized in that including in such as claim 6-9 Described in any item electric power Malicious Code Detection servers based on artificial intelligence.