CN105224873B

CN105224873B - A kind of smart machine document authentication method

Info

Publication number: CN105224873B
Application number: CN201510790427.0A
Authority: CN
Inventors: 陈虹宇; 罗阳; 苗宁
Original assignee: SICHUAN SHENHU TECHNOLOGY Co Ltd
Current assignee: SICHUAN SHENHU TECHNOLOGY Co Ltd
Priority date: 2015-11-17
Filing date: 2015-11-17
Publication date: 2018-06-08
Anticipated expiration: 2035-11-17
Also published as: CN105224873A

Abstract

The present invention provides a kind of smart machine document authentication method, this method includes：The feature vector of scripted code in extraction web page files simultaneously classifies to feature vector, identifies that the web page files whether there is malicious code according to classification results.The present invention proposes a kind of file detection recognition method, and different invasion modes are detected using different mode classifications, and introducing Fuzzy Processing prevents the camouflage of malicious code, improves detection success rate.

Description

A kind of smart machine document authentication method

Technical field

The present invention relates to Computer Data Security, more particularly to a kind of smart machine document authentication method.

Background technology

As the continuous development of Internet is with universal, various network safety events emerge in an endless stream, entire mobile Internet Environment receive serious threat, to society bring huge loss.Network safety event is mostly that hacker attacks behavior is drawn It rises, and immanent cause is the security breaches of software or document itself.The loophole is utilized in invader, to the webpage in mobile equipment File is distorted or is pretended, and makes ordinary user's None- identified, takes this opportunity to perform or distribute illegal program.Existing web page files inspection Survey includes static detection and dynamic detection, but the function that triggers and event when all referring to monitor running paper, without consider into The fuzzy treatment technology that the person of invading uses, thus malicious script code discrimination is very low, and existing detection model has been used and imitated True technology, the computing resource consumption for making mobile device end are excessive.

Invention content

To solve the problems of above-mentioned prior art, the present invention proposes a kind of smart machine document authentication method, Including：

The feature vector of scripted code in extraction web page files simultaneously classifies to feature vector, is known according to classification results Not described web page files whether there is malicious code.

Preferably, the feature of the extraction code, further comprises：

Scripted code from web page files is extracted first, then as unit of word, feature extraction is carried out, to extraction The feature vector arrived carries out feature selecting processing, and increases the weight of key feature vector；In web page files, according to keyword Position the entry position of scripted code；Wherein, the extraction of scripted code is specifically included below：

1. open web page files；

2. initialize internal data structure；

3. carrying out catalog directory retrieval, active dictionary entry address is found；

4. pair position candidate containing scripted code scans for, and detects the data type of dictionary entry；

5. if its data type is the element in predefined keywords set, just contain scripted code in this dictionary, Scripted code is extracted；

6. a pair scripted code decompresses；

Scripted code stream after coding is decoded, judges that the character in stream whether by coding, that is, judges generation Whether containing coding mode field in the head of code stream, if so, decoding functions is called to be decoded；Finally preserve result；

This method further includes：Before described eigenvector extraction, web page files are pre-processed, the first step is to webpage text Executable scripted code in part is positioned and is extracted, and the scripted code extracted is decoded and Anti-fuzzy by second step Processing, finally obtains original scripted code.

Preferably, it is described to classify to feature vector, further comprise：

Web page files are divided into two parts, a part is embedded scripted code, another part is except script generation Remaining web page files data other than code, are then respectively detected two parts of web page files, utilize Bayes algorithm structures Identification model is built, scripted code is detected, specific identification process includes：Unknown web page files X is calculated respectively belongs to safety Sample set C_nProbability P_NBelong to the probability P of malice sample set with web page files X_M, then by P_NAnd P_MIt is compared, obtains webpage The classification that file X is most approached, so as to judge the classification of unknown web page files X, if P_M>P_NIt then represents to contain in the web page files Otherwise malicious script code does not contain malicious script code in the web page files；

The identification model built using Decision tree classified algorithms examines web page files data remaining in addition to scripted code It surveys；Finally, testing result is merged into processing, obtains final recognition result；If in the recognition result of one of two parts There are one output result malicious files, then unknown web page files are identified as malicious file, if the recognition result of two parts is all For secure file, then unknown web page files are secure file.

The present invention compared with prior art, has the following advantages：

The present invention proposes a kind of file detection recognition method, and different invasion modes are carried out using different mode classifications Detection, introducing Fuzzy Processing prevents the camouflage of malicious code, improves detection success rate.

Description of the drawings

Fig. 1 is the flow chart of smart machine document authentication method according to embodiments of the present invention.

Specific embodiment

Retouching in detail to one or more embodiment of the invention is hereafter provided together with the attached drawing for illustrating the principle of the invention It states.The present invention is described with reference to such embodiment, but the present invention is not limited to any embodiments.The scope of the present invention is only by right Claim limits, and the present invention covers many replacements, modification and equivalent.Illustrate in the following description many details with Just it provides a thorough understanding of the present invention.These details are provided for exemplary purposes, and without in these details Or all details can also realize the present invention according to claims.

An aspect of of the present present invention provides a kind of smart machine document authentication method.Fig. 1 is according to embodiments of the present invention Smart machine document authentication method flow chart.

The present invention makes detection for two different invasion mode specific aims, using two different feature extractions and divides Class method establishes identification module, then carries out parallel connection to identification module, the scripted code in web page files is carried out complete anti- Fuzzy preceding operation ensures the validity for the set of eigenvectors attacked based on scripted code.Based on multiclass classification process, difference is entered It invades mode to be detected using different assorting processes, improves detection success rate.

The web page files detection method of the present invention mainly has three big modules：Data prediction, feature extraction and web page files Identification.

(1) data prediction：It is pre-processed for the text set based on scripted code invasion mode.According to being based on foot This code invades the analysis of mode and web page files structure, positioning, sentencing to can perform scripted code in web page files first Which object disconnected scripted code is present in, and according to the adduction relationship between object, the scripted code in object is extracted, It is stored in new text file；Then according to the coding mode of scripted code, the scripted code by coding is decoded, To restore original scripted code；Finally, Anti-fuzzy processing is carried out to scripted code, removes the redundancy in scripted code, Finally obtain original script code.

(2) feature extraction：The present invention proposes two different feature extraction modes, for what is invaded based on scripted code Web page files, extraction is characterized in base unit for word, the time required to reducing extraction；For being based on non-scripted code invasion mode When extracting feature, web page files are divided for the web page files of realization, then still using identical with existing feature extraction Method, after feature extraction terminates, by feature selecting algorithm, the dimension for effectively reducing feature selects identification higher Feature.

(3) web page files identify：According to two different feature extraction modes, based on Bayes assorting processes and decision tree Assorting process establishes two different disaggregated models, then by the way of in parallel, two assorting processes is combined, are carried The high verification and measurement ratio of model.

Before characteristic vector pickup, it is necessary first to position of the scripted code in web page files is determined, from web page files Scripted code is extracted, if the scripted code is by the processing such as encoding, compressing and obscure, need to restore most original Scripted code, finally according to feature extraction algorithm extract set of eigenvectors.

When being detected to unknown web page files, it is necessary first to executable scripted code is extracted never in Hownet page file, and And scripted code is decoded and de-fuzzy processing, obtain original scripted code.Then according to string matching algorithm, Feature vector matching is carried out, judges that there are which feature vectors in scripted code.Finally according to Bayes algorithms and by training sample Obtained data judge the classification of the unknown web page files.

General detection may be used for the web page files detection invaded based on non-scripted code：Training sample is extracted first The feature vector of this collection.Training sample set is divided into two classes：Based on non-scripted code invasion malicious file sample set and without foot The secure file sample set of this code.In feature extraction, need to extract the set of eigenvectors of two different sample sets, root respectively According to certain feature selecting algorithm, two set of eigenvectors are handled, to obtain the set of eigenvectors of learning algorithm needs. Then according to learning algorithm and the set of eigenvectors of extraction, identification model is established.The present invention is established using decision tree assorting process Identification model.Finally unknown web page files are identified.

When unknown web page files are identified, it is necessary first to extract the set of eigenvectors of unknown web page files, the spy Sign vector set can effectively show the unknown web page files, can be known using this feature vector set instead of web page files Not.Then identification set of eigenvectors established as the input of identifier, identifier according to oneself, knows set of eigenvectors Do not classify.Finally obtain the classification results of unknown web page files.

Characteristic extracting module proposed by the present invention invades mode, using two different spies according to existing web page files Levy extracting mode extraction feature vector.For the characteristic vector pickup based on scripted code invasion mode, first by scripted code It is extracted from web page files, the processing such as Anti-fuzzy is carried out to the scripted code, obtain original scripted code.Then with word For unit, feature extraction is carried out.Feature selecting processing finally is carried out to the feature vector extracted, and increases key feature vector Weight, ensure the set of eigenvectors finally obtained have higher validity.For based on non-scripted code invasion mode Set of eigenvectors is extracted, and using by web page files piecemeal, extracts feature vector respectively, is then carried out feature selecting processing, is obtained Last feature vector.

Before to the characteristic vector pickup based on scripted code invasion mode, two are divided into the pretreatment of web page files Step, the first step are positioned and are extracted to the executable scripted code in web page files, the script that second step will extract Code be decoded with the processing such as Anti-fuzzy, finally obtain original scripted code.

In web page files, scripted code usually exists in dictionary.Several groups of entries that dictionary includes, every group of entry is all It is made of key and value, wherein key must be moniker, and the key in a dictionary is unique；Value can be any conjunction Method object.There are two types of embedded modes for scripted code：A kind of is directly with hexadecimal or the character string of text mode, another kind It is stored in another object, is called indirectly by pointer.In the latter case, it is encrypted to be generally stored inside a process for it Stream.

In order to reliably extract scripted code, need to being handled on semantic hierarchies in text.In general web page files In, it can be according to the entry position of keyword positioning scripted code.Scripted code other than being directly embedded into web page files, It can reside in other web page files of local host, it could even be possible to residing on distance host.Scripted code is supported Dynamic call.

The extraction of scripted code is described below

1. open web page files；

2. initialize internal data structure；

It may be scanned for 4. pair above-mentioned containing the position candidate of scripted code, and detect the data type of dictionary entry；

6. a pair scripted code decompresses.

It is usually the stream by coding in indirect referencing object, in the object, to the script generation after coding Code decoding：Judge whether the character in stream passes through coding, that is, judge whether containing coding mode field in the head of stream, if Have, decoding functions is called to be decoded；Finally preserve result.

Malicious file can escape detection by increasing redundancy section.Web page files are opened when web page files reader to collapse Burst when, user will be considered that be web page files oneself through damage, actually malicious script code is just in running background.Even some malice Malicious script code is embedded in before web page files head or after end mark by file.Anti-fuzzy processing is in order to by script generation Code carries out a most originalization processing, and the feature extraction for back is laid a solid foundation.In the present invention, Anti-fuzzy processing master The string segmentation and additional redundancy content the two fuzzy technologys to be directed in scripted code are handled.Firstly the need of removal Secondly the annotation unrelated with scripted code operation needs to restore the character string after being divided, is reduced to original character String.Can be more than 50 bytes with its length of a large amount of variable, in order to be handled in next step conveniently, to these variables in scripted code It is handled, if variable-length is more than 50 bytes, carries out Uniform Name.

By the data prediction of early period, present scripted code oneself through for most original scripted code, extraction feature to The detailed process of amount is as follows.

1. scripted code is divided into the character string s as unit of word；

2. establish word frequency look-up table m；

3. traversing character string s, word w is checked whether in m, if jumping to 4, otherwise, jumping to 5；

4. the word frequency m [w] of word w increases 1 in look-up table；

5. word frequency m [w]=l of word w in look-up table；

6. traversing m, traversal pointer is ptr；

7. if m is keyword, the corresponding feature weights of ptr are increased into maximum value；

8. first five feature vector is chosen as last set of eigenvectors.

When carrying out non-scripted code Intrusion Signatures vector extraction, training sample set is divided into two classes：Based on non-script generation The code malicious file sample set of technology and secure file sample set.Needing the feature vector extracted, there are two features：First, at certain Occurrence frequency is higher in class sample set, but occurrence frequency is relatively low in another kind of sample set.If meeting the two features, This feature vector set can be very good to distinguish two different sample sets.According to above to the description of feature vector, non-script generation The characteristic vector pickup process of code invasion is as follows：

1. extract malice sample set set of eigenvectors T_m, and calculate the word frequency tf of wherein each feature vector_{M, i}；

2. the safe sample set set of eigenvectors T of extraction_nAnd calculate the word frequency tf of wherein each feature vector_{N, j}；

3. calculate T_mIn each inverse-document-frequency idf of the feature vector in safe sample set_{M, i}；

4. calculate T_nIn each inverse-document-frequency idf of the feature vector in malice sample set_{N, j}；

5. selecting the set of eigenvectors of different sample sets respectively, be then combined with, obtain the feature of non-scripted code invasion to Quantity set.

When the web page files detection method of the present invention carries out classification and Detection to unknown web page files, first from web page files Scripted code is extracted, which is divided into two parts：One is embedded scripted code, another part is except script Remaining web page files data other than code.Then two parts of web page files are detected respectively, utilize Bayes algorithms The identification model of structure is detected scripted code, using the identification model that Decision tree classified algorithms are built to the residue of web page files Data are detected.Finally, testing result in result integration module is handled, obtains the web page files detection side of the present invention Method is to the final detection result of the web page files.Just its idiographic flow is described below.

Set of eigenvectors based on scripted code invasion is using simple and practical Bayes assorting processes as assorting process. Unknown web page files X is calculated respectively belongs to safe sample set C_nProbability P_NBelong to the probability of malice sample set with web page files X P_M, then by P_NAnd P_MIt is compared, obtains the classification that web page files X is most approached, so as to judge the classification of unknown web page files X. If P_M>P_NIt then represents containing malicious script code in the web page files, conversely, then not containing malicious script in the web page files Code.

Before web page files detection based on the invasion of non-scripted code, wherein Sample is training sample set, and Vector is base In the set of eigenvectors of non-scripted code invasion.

Establish decision tree root root nodes；

If Sample all for just, return label be+single node tree root；

If Sample is anti-, return label be _ single node tree root；

If Vector is sky, then it is most common object vector value in Sample to return to single node root, label；

Otherwise, for each probable value v of Vector_i

Add a new branch v under root_i, enable Sample_siFor Sample_siMeet Vector property values for v_iSon Collection；

If Sample_siFor sky, under this new branch plus a leaf node, the label of node in Sample most Universal object vector value；

Otherwise add a subtree under new branch：

(Sample_si, object vector value, Vector), terminate.

After the disaggregated model based on decision tree, which is established, to be completed, it is possible to according to disaggregated model to unknown web page files It is detected：

1. web page files are divided according to 100 byte-sizeds, file data blocks are obtained；

2. the feature vector of each web page files data block of extraction；

3. the feature vector of all web page files data blocks is carried out Integrated Selection, web page files vector to the end is obtained Collection；

4. using this feature vector set as the input of judgement tree classification model；

5. item according to judgement tree classification model output may determine that whether the web page files are to enter using non-scripted code The web page files invaded.

During realization, parallel connection is carried out, and need to be to two different identifications to two different Classification and Identification models The result of model is handled.By the input of the output of two different identification modules, as a result integration module, according to result Processing function in integration module, if there are one result is exported as M (malicious file) in two different identification modules, not Hownet page file is then malicious file, unknown if the recognition result of two different identification modules is all N (secure file) Web page files are secure file.

In conclusion the present invention proposes a kind of file detection recognition method, different points is utilized to different invasion modes Class mode is detected, and introducing Fuzzy Processing prevents the camouflage of malicious code, improves detection success rate.

It obviously, can be with general it should be appreciated by those skilled in the art each module or each step of, the above-mentioned present invention Computing system realize that they can concentrate in single computing system or be distributed in multiple computing systems and be formed Network on, optionally, they can be realized with the procedure script code that computing system can perform, it is thus possible to by them Storage is performed within the storage system by computing system.It to be combined in this way, the present invention is not limited to any specific hardware and softwares.

It should be understood that the above-mentioned specific embodiment of the present invention is used only for exemplary illustration or explains the present invention's Principle, without being construed as limiting the invention.Therefore, that is done without departing from the spirit and scope of the present invention is any Modification, equivalent replacement, improvement etc., should all be included in the protection scope of the present invention.In addition, appended claims purport of the present invention Covering the whole variations fallen into scope and boundary or this range and the equivalent form on boundary and repairing Change example.

Claims

1. a kind of smart machine document authentication method, which is characterized in that including：

The feature vector of scripted code in extraction web page files simultaneously classifies to feature vector, and institute is identified according to classification results Web page files are stated with the presence or absence of malicious code；

If the result is invaded for non-scripted code, before web page files detection, decision tree root root nodes are established；

If Sample all for just, return label be+single node tree root；

If Sample is anti-, return label be _ single node tree root；

If Vector is sky, then it is most common object vector value in Sample to return to single node root, label；Wherein Sample is training sample set, and Vector is the set of eigenvectors invaded based on non-scripted code；

Otherwise, for each probable value v of Vector_i

Add a new branch v under root_i, enable Sample_siFor Sample_siMeet Vector property values for v_iSubset；

If Sample_siFor sky, a leaf node is added under this new branch, the label of node is most universal in Sample Object vector value；

Otherwise add a subtree under new branch：

(Sample_si, object vector value, Vector), terminate；

After the disaggregated model based on decision tree, which is established, to be completed, it is possible to unknown web page files be carried out according to disaggregated model Detection：

2. the feature vector of each web page files data block of extraction；

3. the feature vector of all web page files data blocks is carried out Integrated Selection, web page files vector set to the end is obtained；

5. item according to judgement tree classification model output may determine that whether the web page files are to be invaded using non-scripted code Web page files；

During realization, parallel connection is carried out, and need to be to two different identification models to two different Classification and Identification models Result handled；By the input of the output of two different identification modules, as a result integration module, integrated according to result Mould processing function in the block, if exporting result there are one in two different identification modules as M, that is, malicious file, non-Hownet Page file is then malicious file, if the recognition result of two different identification modules is all N, that is, secure file, unknown webpage File is secure file；

The feature of the extraction code, further comprises：

Scripted code from web page files is extracted first, then as unit of word, feature extraction is carried out, to what is extracted Feature vector carries out feature selecting processing, and increases the weight of key feature vector；In web page files, positioned according to keyword The entry position of scripted code；Wherein, the extraction of scripted code is specifically included below：

1. open web page files；

2. initialize internal data structure；

5. if its data type is the element in predefined keywords set, just contain scripted code in this dictionary, to foot This code extracts；

6. a pair scripted code decompresses；

Scripted code stream after coding is decoded, judges that the character in stream whether by coding, that is, judges code flow Head in whether containing coding mode field, if so, decoding functions is called to be decoded；Finally preserve result；

This method further includes：Before described eigenvector extraction, web page files are pre-processed, the first step is in web page files Executable scripted code positioned and extracted, second step by the scripted code extracted be decoded at Anti-fuzzy Reason, finally obtains original scripted code；

By the own scripted code through for most original of the scripted code of the data prediction of early period, the detailed mistake of feature vector is extracted Journey is as follows：

1. scripted code is divided into the character string s as unit of word；

2. establish word frequency look-up table m；

4. the word frequency m [w] of word w increases 1 in look-up table；

5. word frequency m [w]=l of word w in look-up table；

6. traversing m, traversal pointer is ptr；

8. first five feature vector is chosen as last set of eigenvectors.

2. according to the method described in claim 1, it is characterized in that, described classify to feature vector, further comprise：

Web page files are divided into two parts, a part is embedded scripted code, another part be except scripted code with Outer remaining web page files data, are then respectively detected two parts of web page files, are built and known using Bayes algorithms Other model, is detected scripted code, and specific identification process includes：Unknown web page files X is calculated respectively belongs to safe sample Collect C_nProbability P_NBelong to the probability P of malice sample set with web page files X_M, then by P_NAnd P_MIt is compared, obtains web page files The classification that X is most approached, so as to judge the classification of unknown web page files X, if P_M>P_NIt then represents in the web page files containing malice Otherwise scripted code does not contain malicious script code in the web page files；

The identification model built using Decision tree classified algorithms is detected web page files data remaining in addition to scripted code；Most Afterwards, testing result is merged into processing, obtains final recognition result；If there are one in the recognition result of one of two parts Result malicious file is exported, then identifies unknown web page files as malicious file, if the recognition result of two parts is all safety File, then unknown web page files are secure file.