CN105095756A

CN105095756A - Method and device for detecting portable document format document

Info

Publication number: CN105095756A
Application number: CN201510391902.7A
Authority: CN
Inventors: 苟孟洛
Original assignee: Beijing Kingsoft Internet Security Software Co Ltd
Current assignee: Beijing Kingsoft Internet Security Software Co Ltd
Priority date: 2015-07-06
Filing date: 2015-07-06
Publication date: 2015-11-25

Abstract

The invention provides a method and a device for detecting a Portable Document Format (PDF) document, wherein the method for detecting the PDF document comprises the following steps: extracting characteristic values from a file structure of a training PDF document, wherein the training PDF document comprises a malicious PDF document containing an attack code; learning the characteristic value through a machine learning algorithm to generate a detection model; and predicting whether the PDF document to be detected is a malicious PDF document or not through the detection model. The invention realizes the prediction of the document aggressivity on the premise of static analysis, thereby improving the security of the PDF document.

Description

The detection method of Portable Document format document and device

Technical field

The present invention relates to field of information security technology, particularly relate to a kind of Portable Document format (PortableDocumentFormat; Hereinafter referred to as: the PDF) detection method of document and device.

Background technology

Along with the high speed development of internet and the day by day universal of office automation, PDF document has become the open-standards of global electronic document distribution, due to high practicability and the general adaptability of PDF document, becomes the effective carrier of targeted phishing attack.Malicious code has serious destructiveness to computing machine, and therefore the PDF document detected containing malicious code has become the important goal of computer safety field.

But existing detection method all effectively cannot detect the harmfulness of PDF document, thus cause the security of PDF document poor.

Summary of the invention

Object of the present invention is intended to solve one of technical matters in correlation technique at least to a certain extent.

For this reason, first object of the present invention is the detection method proposing a kind of Portable Document format PDF document.The method is predicted the aggressiveness of PDF document under can be implemented in static prerequisite of resolving, and improves the security of PDF document.

Second object of the present invention is the pick-up unit proposing a kind of Portable Document format PDF document.

In order to realize above-described embodiment, the detection method of the Portable Document format PDF document of first aspect present invention embodiment, comprise: from the file structure of training PDF document, extract eigenwert, described training PDF document comprises the malice PDF document comprising attack code; Described eigenwert is carried out study by machine learning algorithm and generates detection model; Predict whether PDF document to be detected is malice PDF document by described detection model.

The detection method of the PDF document of the embodiment of the present invention, by extracting eigenwert in the file structure from training PDF document, above-mentioned eigenwert is carried out study by machine learning algorithm and generates detection model, then predict whether PDF document to be detected is malice PDF document by above-mentioned detection model, because the leaching process of eigenwert is all in static resolving, do not relate to dynamic analysis, whether be malice PDF document predict, and then can improve the security of PDF document if therefore to achieve under the prerequisite of resolving in static state PDF document to be detected.

In order to realize above-described embodiment, the pick-up unit of the Portable Document format PDF document of second aspect present invention embodiment, comprise: extraction module, for extracting eigenwert in the file structure from training PDF document, described training PDF document comprises the malice PDF document comprising attack code; Generation module, the eigenwert for being extracted by described extraction module is carried out study by machine learning algorithm and is generated detection model; Detection module, the detection model for being generated by described generation module predicts whether PDF document to be detected is malice PDF document.

The pick-up unit of the PDF document of the embodiment of the present invention, extraction module is by extracting eigenwert the file structure from training PDF document, above-mentioned eigenwert is carried out study by machine learning algorithm and is generated detection model by generation module, then by above-mentioned detection model, detection module predicts whether PDF document to be detected is malice PDF document, because extraction module extracts the process of eigenwert all in static resolving, do not relate to dynamic analysis, whether therefore to achieve under the prerequisite of resolving in static state PDF document to be detected is that malice PDF document is predicted, and then the security of PDF document can be improved.

The aspect that the present invention adds and advantage will part provide in the following description, and part will become obvious from the following description, or be recognized by practice of the present invention.

Accompanying drawing explanation

The present invention above-mentioned and/or additional aspect and advantage will become obvious and easy understand from the following description of the accompanying drawings of embodiments, wherein:

Fig. 1 is the process flow diagram of a detection method embodiment of PDF document of the present invention;

Fig. 2 is the schematic diagram of another embodiment of detection method of PDF document of the present invention;

Fig. 3 is the schematic diagram that the present invention trains a PDF document embodiment;

Fig. 4 is the schematic diagram of the embodiment that predicts the outcome of detection model of the present invention;

Fig. 5 is the structural representation of a pick-up unit embodiment of PDF document of the present invention;

Fig. 6 is the structural representation of another embodiment of pick-up unit of PDF document of the present invention.

Embodiment

Be described below in detail embodiments of the invention, the example of described embodiment is shown in the drawings, and wherein same or similar label represents same or similar element or has element that is identical or similar functions from start to finish.Being exemplary below by the embodiment be described with reference to the drawings, only for explaining the present invention, and can not limitation of the present invention being interpreted as.On the contrary, embodiments of the invention comprise fall into attached claims spirit and intension within the scope of all changes, amendment and equivalent.

Fig. 1 is the process flow diagram of a detection method embodiment of PDF document of the present invention, and as shown in Figure 1, the detection method of this PDF document can comprise:

Step 101, extracts eigenwert from the file structure of training PDF document.

Wherein, above-mentioned training PDF document comprises the malice PDF document comprising attack code.

Particularly, the process extracting eigenwert from the file structure of training PDF document can realize by PdfStreamDumper instrument, also can realize automatic business processing by coding.

In the present embodiment, above-mentioned eigenwert can comprise: metadata (/Metadata), file operation behavior (such as: "/OpenAction ") and number of pages (/Pages/Count); Certainly, the embodiment of the present invention is not limited in this, and the present invention is not construed as limiting above-mentioned eigenwert, as long as the eigenwert comprised in the file structure of PDF document, such as: above-mentioned eigenwert can also comprise "/JavaScript " etc., does not repeat them here.

Step 102, carries out study by above-mentioned eigenwert by machine learning algorithm and generates detection model.

Particularly, above-mentioned training PDF document comprises at least two training PDF document; Above-mentioned eigenwert is undertaken learning to generate detection model by machine learning algorithm and can be: according to the eigenwert generating training data extracted in the file structure from above-mentioned at least two training PDF document; Above-mentioned training data is normalized, and carries out study generation detection model by machine learning algorithm.

In the present embodiment, the core of machine learning algorithm utilizes Libsvm software to realize, Libsvm software be one simple, be easy to use and support vector machine (SupportVectorMachine fast and effectively; Hereinafter referred to as: the SVM) software package of pattern-recognition and recurrence is a kind of sorter carrying out two class classification.

By above-mentioned detection model, step 103, predicts whether PDF document to be detected is malice PDF document.

Particularly, step 103 can be: the number percent being belonged to malice PDF document by the described PDF document to be detected of above-mentioned detection model acquisition; When the number percent that above-mentioned PDF document to be detected belongs to malice PDF document is positioned at predetermined interval, determine that above-mentioned PDF document to be detected is for malice PDF document; When above-mentioned PDF document to be detected belong to malice PDF document number percent not in above-mentioned predetermined interval time, determine above-mentioned PDF document to be detected be not malice PDF document.

Wherein, above-mentioned predetermined interval can when specific implementation according to realizing the sets itself such as demand and/or system performance, and the size of the present embodiment to predetermined interval is not construed as limiting.

Above-described embodiment, by extracting eigenwert in the file structure from training PDF document, above-mentioned eigenwert is carried out study by machine learning algorithm and generates detection model, then predict whether PDF document to be detected is malice PDF document by above-mentioned detection model, because the leaching process of eigenwert is all in static resolving, do not relate to dynamic analysis, whether be malice PDF document predict, and then can improve the security of PDF document if therefore to achieve under the prerequisite of resolving in static state PDF document to be detected.

Fig. 2 is the schematic diagram of another embodiment of detection method of PDF document of the present invention, in Fig. 2, training PDF document (TrainingPDFFile) and PDF document to be detected (TestPDFFile) are all the PDF document of collecting from the external world at random, wherein, TrainingPDFFile is the malice PDF document comprising attack code.Afterwards through characteristics extraction, the eigenwert extracted from TrainingPDFFile is learnt by machine learning algorithm, comprising a series of processes such as parameter training, data normalization process, with this detection model, final generation detection model (Model), finally predicts whether TestPDFFile is malice PDF document.

Wherein, the core of machine learning algorithm utilizes Libsvm software to realize, and Libsvm software is simple, to be easy to a use and SVM pattern-recognition fast and effectively and recurrence software package, is a kind of sorter carrying out two class classification.

In order to verify the feasibility of detection model, can search at random from network and getting some PDF document and comprise the malice PDF document comprising attack code as TrainingPDFFile, TrainingPDFFile, also can have normal PDF document.Wherein malice document be announced with public leak and exposure (CommonVulnerabilities & Exposures; Hereinafter referred to as: CVE) numbering and include the vulnerability exploit file (exploit) of malicious code, Fig. 3 is the schematic diagram that the present invention trains a PDF document embodiment.

The eigenwert that three ratios are easier to distinguish malice document and normal document is chosen in these samples, as follows respectively:

(1) metadata (/Metadata): if PDF document has metadata result to be " 1 ", if PDF document does not have metadata result to be " 0 ".Wherein, generally do not comprise metadata to reduce document size in malice PDF document, and generally include metadata for reducing document size in normal PDF document.

(2) opening operation (/OpenAction/JS): malice PDF document generally can wrap left-handed javascript code, and result given here is the quantity of javascript code.

(3) number of pages (/Pages/Count): the number of pages of malice document is generally 1 page, when malice document is opened, can not jump to certain one page of document, therefore can not find "/TYPE " and "/Pages/count " these two eigenwerts.

Extract the eigenwert of TrainingPDFFile afterwards, and generating training data is as follows successively: (normal document 1, malice document 2)

11:12:03:110

11:12:03:363

11:02:03:23

11:02:03:26

11:02:03:7

21:02:13:1

21:12:13:1

Identical way, obtains test data from TestPDFFile, and that wherein TestPDFFile chooses is known malice PDF document CVE2009-0027, and supposes that it is normal document 1, thus it is as follows to obtain test data:

11:02:13:1

Afterwards these data are normalized searching outcome parameter, training pattern, generation model is predicted, these action needs operate in Libsvm.In order to test conveniently, the whole process of operation all uses default parameters, and easy.py order can be used to carry out simple forecast, acquisition predict the outcome as shown in Figure 4, Fig. 4 is the schematic diagram of the embodiment that predicts the outcome of detection model of the present invention.

What wherein deposit inside train.txt is the eigenwert that TrainPDFFile extracts, what deposit inside test.txt is the eigenwert that TestPDFFile extracts, and have recorded detailed predicting the outcome and corresponding accuracy rate inside output file test.txt.predict.As can be seen from Figure 4, the number percent that TestPDFFile belongs to normal PDF document is 0%, that is, the number percent that TestPDFFile belongs to malice PDF document is 100%, suppose predetermined interval for [70%, ∞), therefore TestPDFFile belong to malice PDF document number percent be positioned at predetermined interval, therefore can determine that TestPDFFile (CVE2009-0027) is for malice document, demonstrates the validity of the detection model that the present invention proposes.

The invention provides a kind of detection method of PDF document, and demonstrate its feasibility, but because detection model is generated by the feature of known sample, and for predicting unknown sample, so further developing along with following assault technology, new malice PDF attack pattern continues to bring out, and the detection model that the present invention proposes also can by perfect gradually.But just at present, this detection model still has very strong validity and vitality, only needs collect abundant pdf document from the external world and extract its eigenwert, and the sample of collection is more, the eigenwert extracted is more, and training predicting the outcome of detection model out also can be more accurate.

Fig. 5 is the structural representation of a pick-up unit embodiment of PDF document of the present invention, the pick-up unit of the PDF document in the present embodiment can realize the present invention's flow process embodiment illustrated in fig. 1, as shown in Figure 5, the pick-up unit of this PDF document can comprise: extraction module 51, generation module 52 and detection module 53;

Wherein, extraction module 51, for extracting eigenwert in the file structure from training PDF document, above-mentioned training PDF document comprises the malice PDF document comprising attack code; Particularly, the process that extraction module 51 extracts eigenwert from the file structure of training PDF document can realize by PdfStreamDumper instrument, also can realize automatic business processing by coding.

Generation module 52, the eigenwert for being extracted by extraction module 51 is carried out study by machine learning algorithm and is generated detection model; In the present embodiment, the core of machine learning algorithm utilizes Libsvm software to realize, and Libsvm software is simple, to be easy to a use and SVM pattern-recognition fast and effectively and recurrence software package, is a kind of sorter carrying out two class classification.

Detection module 53, the detection model for being generated by generation module 52 predicts whether PDF document to be detected is malice PDF document.

In the pick-up unit of above-mentioned PDF document, extraction module 51 is by extracting eigenwert the file structure from training PDF document, above-mentioned eigenwert is carried out study by machine learning algorithm and is generated detection model by generation module 52, then by above-mentioned detection model, detection module 53 predicts whether PDF document to be detected is malice PDF document, because extraction module 51 extracts the process of eigenwert all in static resolving, do not relate to dynamic analysis, whether therefore to achieve under the prerequisite of resolving in static state PDF document to be detected is that malice PDF document is predicted, and then the security of PDF document can be improved.

Fig. 6 is the structural representation of another embodiment of pick-up unit of PDF document of the present invention, and compared with the device shown in Fig. 5, difference is, in embodiment illustrated in fig. 6, above-mentioned training PDF document comprises at least two training PDF document; Generation module 52 can comprise: data genaration submodule 521 and model generation submodule 522;

Wherein, data genaration submodule 521, for the eigenwert generating training data extracted from the file structure of above-mentioned at least two training PDF document according to extraction module 51;

Model generation submodule 522, is normalized for the training data generated by data genaration submodule 521, and carries out study generation detection model by machine learning algorithm.

In the present embodiment, detection module 53 can comprise: obtain submodule 531 and determine submodule 532;

Wherein, obtain submodule 531, belong to the number percent of malice PDF document for being obtained PDF document to be detected by above-mentioned detection model;

Determining submodule 532, during for being positioned at predetermined interval when the number percent obtaining submodule 531 acquisition, determining that above-mentioned PDF document to be detected is for malice PDF document; When the number percent that acquisition submodule 531 obtains is not in above-mentioned predetermined interval, determine that above-mentioned PDF document to be detected is not malice PDF document.

The pick-up unit of above-mentioned PDF document achieves to be predicted the aggressiveness of document under the prerequisite of static state parsing, and then can improve the security of PDF document.

It should be noted that, in describing the invention, except as otherwise noted, the implication of " multiple " is two or more.

Describe and can be understood in process flow diagram or in this any process otherwise described or method, represent and comprise one or more for realizing the module of the code of the executable instruction of the step of specific logical function or process, fragment or part, and the scope of the preferred embodiment of the present invention comprises other realization, wherein can not according to order that is shown or that discuss, comprise according to involved function by the mode while of basic or by contrary order, carry out n-back test, this should understand by embodiments of the invention person of ordinary skill in the field.

Should be appreciated that each several part of the present invention can realize with hardware, software, firmware or their combination.In the above-described embodiment, multiple step or method can with to store in memory and the software performed by suitable instruction execution system or firmware realize.Such as, if realized with hardware, the same in another embodiment, can realize by any one in following technology well known in the art or their combination: the discrete logic with the logic gates for realizing logic function to data-signal, there is the special IC of suitable combinational logic gate circuit, programmable gate array (ProgrammableGateArray; Hereinafter referred to as: PGA), field programmable gate array (FieldProgrammableGateArray; Hereinafter referred to as: FPGA) etc.

Those skilled in the art are appreciated that realizing all or part of step that above-described embodiment method carries is that the hardware that can carry out instruction relevant by program completes, described program can be stored in a kind of computer-readable recording medium, this program perform time, step comprising embodiment of the method one or a combination set of.

In addition, each functional module in each embodiment of the present invention can be integrated in a processing module, also can be that the independent physics of modules exists, also can two or more module integrations in a module.Above-mentioned integrated module both can adopt the form of hardware to realize, and the form of software function module also can be adopted to realize.If described integrated module using the form of software function module realize and as independently production marketing or use time, also can be stored in a computer read/write memory medium.

The above-mentioned storage medium mentioned can be ROM (read-only memory), disk or CD etc.

In the description of this instructions, specific features, structure, material or feature that the description of reference term " embodiment ", " some embodiments ", " example ", " concrete example " or " some examples " etc. means to describe in conjunction with this embodiment or example are contained at least one embodiment of the present invention or example.In this manual, identical embodiment or example are not necessarily referred to the schematic representation of above-mentioned term.And the specific features of description, structure, material or feature can combine in an appropriate manner in any one or more embodiment or example.

Although illustrate and describe embodiments of the invention above, be understandable that, above-described embodiment is exemplary, can not be interpreted as limitation of the present invention, and those of ordinary skill in the art can change above-described embodiment within the scope of the invention, revises, replace and modification.

Claims

1. a detection method for Portable Document format PDF document, is characterized in that, comprising:

From the file structure of training PDF document, extract eigenwert, described training PDF document comprises the malice PDF document comprising attack code;

Described eigenwert is carried out study by machine learning algorithm and generates detection model;

Predict whether PDF document to be detected is malice PDF document by described detection model.

2. method according to claim 1, is characterized in that, described training PDF document comprises at least two training PDF document; Described by described eigenwert by machine learning algorithm carry out study generate detection model comprise:

According to the eigenwert generating training data extracted in the file structure from described at least two training PDF document;

Described training data is normalized, and carries out study generation detection model by machine learning algorithm.

3. method according to claim 1, is characterized in that, describedly predicts that whether PDF document to be detected is that malice PDF document comprises by described detection model:

The number percent of malice PDF document is belonged to by the described PDF document to be detected of described detection model acquisition;

When the number percent that described PDF document to be detected belongs to malice PDF document is positioned at predetermined interval, determine that described PDF document to be detected is for malice PDF document;

When described PDF document to be detected belong to malice PDF document number percent not in described predetermined interval time, determine described PDF document to be detected be not malice PDF document.

4. the method according to claim 1-3 any one, is characterized in that, described eigenwert comprises: metadata, file operation behavior and number of pages.

5. a pick-up unit for Portable Document format PDF document, is characterized in that, comprising:

Extraction module, for extracting eigenwert in the file structure from training PDF document, described training PDF document comprises the malice PDF document comprising attack code;

Generation module, the eigenwert for being extracted by described extraction module is carried out study by machine learning algorithm and is generated detection model;

Detection module, the detection model for being generated by described generation module predicts whether PDF document to be detected is malice PDF document.

6. device according to claim 5, is characterized in that, described training PDF document comprises at least two training PDF document; Described generation module comprises: data genaration submodule and model generation submodule;

Described data genaration submodule, for the eigenwert generating training data extracted from the file structure of described at least two training PDF document according to described extraction module;

Described model generation submodule, is normalized for the training data generated by described data genaration submodule, and carries out study generation detection model by machine learning algorithm.

7. device according to claim 5, is characterized in that, described detection module comprises:

Obtain submodule, for being obtained the number percent that described PDF document to be detected belongs to malice PDF document by described detection model;

Determine submodule, for when the number percent that described acquisition submodule obtains is positioned at predetermined interval, determine that described PDF document to be detected is for malice PDF document; When the number percent that described acquisition submodule obtains is not in described predetermined interval, determine that described PDF document to be detected is not malice PDF document.

8. the device according to claim 5-7 any one, is characterized in that, the eigenwert that described extraction module extracts comprises: metadata, file operation behavior and number of pages.