CN114510716A

CN114510716A - Document detection method, model training method, device, terminal and storage medium

Info

Publication number: CN114510716A
Application number: CN202210066690.5A
Authority: CN
Inventors: 徐钟豪; 陈伟; 谢忱; 刘伟
Original assignee: Shanghai Douxiang Information Technology Co ltd
Current assignee: Shanghai Douxiang Information Technology Co ltd
Priority date: 2022-01-20
Filing date: 2022-01-20
Publication date: 2022-05-17

Abstract

The application provides a document detection method, a model training device, a terminal and a storage medium, the scheme extracts feature information corresponding to preset features from macro codes by obtaining the macro codes of documents to be detected, then realizes detection of the documents to be detected directly according to the preset malicious document detection model and the feature information of the documents to be detected, and the documents to be detected do not need to be put into a sandbox for detection, so that the sandbox environment does not need to be deployed, the cost is lower, the detection is more convenient, and the detection speed is higher.

Description

Document detection method, model training method, device, terminal and storage medium

Technical Field

The application relates to the technical field of internet, in particular to a document detection method, a model training method, a device, a terminal and a storage medium.

Background

By malicious Office document, it is meant an Office document that steals sensitive information, monitors and destroys the normal activities of the user by embedding and executing malicious code or exploiting its structural characteristics. Analyzing APT (Advanced Persistent attack) events in recent years, it can be seen that malicious Office documents have become the most frequently used attack decoys for lawbreakers. Office software is the most common Office software used in computers of people, and malicious Office document attacks not only bring huge risks to individuals and enterprises, but also seriously threaten national security. After an attacker successfully attacks a computer, the attacker generally uses the computer as a springboard to move transversely, attack other computers or facilities in a network, steal confidential data, or destroy important infrastructure. Therefore, it is necessary to detect the document to distinguish between the malicious document and the normal document.

At present, malicious documents and normal documents are mainly detected and distinguished based on sandboxes, the method needs to throw the documents to be detected into the sandboxes for dynamic execution, then the execution results are analyzed, the detection is not convenient enough, the speed is slow, and the cost for deploying the sandboxes is high.

Disclosure of Invention

An embodiment of the present application aims to provide a document detection method, a model training method, an apparatus, a terminal, and a storage medium, so as to solve the problems that detection of malicious documents by sandbox is not convenient enough, detection speed is slow, and cost is high in the prior art.

The embodiment of the application provides a document detection method, which comprises the following steps:

acquiring a macro code of a document to be detected;

extracting feature information corresponding to preset features from the macro code; the preset features comprise at least one of statistical features and training features, the statistical features are features used for representing statistical data in the macro code, the training features are features determined according to preset training sample documents, and the preset training sample documents comprise training sample documents provided with malicious class labels and training sample documents provided with normal class labels;

and inputting the characteristic information into a preset malicious document detection model to obtain a detection result of the document to be detected.

In the implementation process, the detection of the document to be detected is directly carried out according to the preset malicious document detection model, and the document to be detected does not need to be put into a sandbox for detection, so that the sandbox environment does not need to be deployed, the cost is lower, the detection is more convenient, and the detection speed is faster.

Further, when the preset features include the training features, the method further includes the step of determining the training features according to the preset training sample documents:

and performing feature screening on the preset training sample document based on at least one of a bag-of-words model and a TF-IDF model to obtain the training features.

In the implementation process, the training characteristics are obtained by performing characteristic screening based on at least one of the bag-of-words model and the TF-IDF model, and the characteristics are screened, so that the method is more representative and the detection result is more accurate.

Further, the performing feature screening on the preset training sample document based on at least one of a bag-of-words model and a TF-IDF model to obtain the training features includes:

carrying out feature screening on the preset training sample document based on a bag-of-words model to obtain bag-of-words features, carrying out feature screening on the preset training sample document based on a TF-IDF model to obtain TF-IDF features, and solving a union set or an intersection set of the bag-of-words features and the TF-IDF features to obtain the training features;

or the like, or, alternatively,

and performing feature screening on the preset training sample document based on a bag-of-words model to obtain bag-of-words features, and filtering the bag-of-words features based on a TF-IDF model to obtain the training features.

In the implementation process, the bag-of-words model and the TF-IDF model are used for feature screening, namely training features are obtained based on the occurrence frequency and the recognition degree of the features, so that the obtained training features are higher in representativeness, and the accuracy of the detection result is further improved.

Further, the method further comprises:

when the detection result indicates that the type of the document to be detected is a malicious document, directly determining the document to be detected as the malicious document;

and when the detection result indicates that the type of the document to be detected is a normal document, comparing the macro code with a preset key code, and determining whether the type of the document to be detected is changed from the normal document to a malicious document according to the comparison result.

In the implementation process, when the detection result of the malicious document detection model indicates that the type of the document to be detected is a normal document, the detection of the key codes is performed to determine whether the type of the document to be detected is changed from the normal document to the malicious document, the first round of detection is performed on the basis of the model, then the further second round of detection is performed from the detection dimension of the key codes, the accuracy of the detection result is improved,

further, the preset key code comprises at least two of a first type code corresponding to a self-starting behavior, a second type code corresponding to a writing behavior, and a third type code corresponding to an executing payload behavior.

In the implementation process, the corresponding key codes are preset based on the representative behavior of the malicious document, so that the accuracy of the detection result is ensured as much as possible.

Further, the determining whether to change the type of the document to be detected from a normal document to a malicious document according to the comparison result includes:

and when at least two types of codes matched with the preset key codes exist in the macro codes, changing the type of the document to be detected from a normal document into a malicious document.

In the implementation process, whether the macro code of the document to be detected contains the corresponding behavior is determined by searching the code matched with the key code, so as to further determine whether the document to be detected is a malicious document.

The embodiment of the application further provides a malicious document detection model training method, which comprises the following steps:

acquiring a training sample document provided with a malicious class label and a training sample document provided with a normal class label;

acquiring a macro code of each training sample document;

extracting feature information corresponding to preset features from each macro code, wherein the preset features comprise at least one of statistical features and training features, the statistical features are features used for representing statistical data in the macro codes, and the training features are features determined according to the training sample documents;

and performing model training according to the characteristic information of each training sample document to obtain a malicious document detection model.

An embodiment of the present application further provides a document detecting apparatus, including:

the acquisition module is used for acquiring the macro code of the document to be detected;

the extraction module is used for extracting feature information corresponding to preset features from the macro code; the preset features comprise at least one of statistical features and training features, the statistical features are features used for representing statistical data in the macro code, the training features are features determined according to preset training sample documents, and the preset training sample documents comprise training sample documents provided with malicious class labels and training sample documents provided with normal class labels;

and the detection module is used for inputting the characteristic information into a preset malicious document detection model to obtain a detection result of the document to be detected.

An embodiment of the present application further provides a terminal, which includes a processor and a memory, where the memory stores a computer program, and the processor executes the computer program to implement any one of the above methods.

An embodiment of the present application further provides a computer-readable storage medium, which stores a computer program, and when the computer program is executed by at least one processor, the computer program implements any one of the above methods.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are required to be used in the embodiments of the present application will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present application and therefore should not be considered as limiting the scope, and that those skilled in the art can also obtain other related drawings based on the drawings without inventive efforts.

FIG. 1 is a flowchart illustrating a document detection method according to an embodiment of the present application;

fig. 2 is a schematic diagram of a code corresponding to a self-boot behavior according to an embodiment of the present application;

FIG. 3 is a diagram illustrating a code corresponding to a write action according to an embodiment of the present application;

fig. 4 is a schematic diagram of code corresponding to a load execution behavior according to an embodiment of the present disclosure;

fig. 5 is a schematic flowchart of a malicious document detection model training method according to a second embodiment of the present application;

FIG. 6 is a schematic structural diagram of a document detecting apparatus according to a third embodiment of the present application;

fig. 7 is a schematic structural diagram of a terminal according to a fourth embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

It should be noted that the descriptions relating to "first", "second", etc. in the embodiments of the present invention are only for descriptive purposes and are not to be construed as indicating or implying relative importance or implicitly indicating the number of indicated technical features. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature. In addition, technical solutions between various embodiments may be combined with each other, but must be realized by a person skilled in the art, and when the technical solutions are contradictory or cannot be realized, such a combination should not be considered to exist, and is not within the protection scope of the present invention.

In the description of the present invention, it should be understood that the numerical references before the steps do not identify the order of performing the steps, but merely serve to facilitate the description of the present invention and to distinguish each step, and thus should not be construed as limiting the present invention.

Various embodiments will be provided to specifically describe a document detection method, a model training method, an apparatus, a terminal and a storage medium.

The first embodiment is as follows:

in order to solve the problems that detection is not convenient enough, detection speed is slow, and cost is high when malicious documents are detected in the prior art, an embodiment of the present application provides a document detection method, please refer to fig. 1, which includes:

s101: and acquiring the macro code of the document to be detected.

The document to be detected in step S101 may be any document containing macro codes, for example, an Office document, and more specifically, the document to be detected includes, but is not limited to, a Word document, an Excel document, and a PPT document.

S102: and extracting feature information corresponding to preset features from the macro code, wherein the preset features comprise at least one of statistical features and training features.

It is understood that in other embodiments, the preset feature in step S102 may be arbitrarily set by a developer, for example, a feature representative in distinguishing a malicious document from a normal document may be selected as the preset feature. The preset feature in this embodiment includes at least one of a statistical feature and a training feature.

The statistical characteristic is a characteristic used for representing statistical data in the macro code, which may be a characteristic representing any statistical data, and specifically, please refer to table 1 below, where the statistical characteristic in this embodiment includes, but is not limited to, at least one of the following 10 characteristics:

TABLE 1 statistical characteristics

Statistical characteristic names	Means of
		vba_avg_param_per_func	Averaging the number of parameters contained in each method
vba_cnt_comment_loc_ratio	Occupation ratio of lines of contained annotations
		vba_cnt_comments	Number of lines of annotations contained
vba_cnt_func_loc_ratio	Ratio of number of method to number of non-empty rows
		vba_cnt_functions	Number of methods involved
vba_cnt_loc	Number of all non-empty rows
		vba_entropy_chars	Entropy of all letters
vba_entropy_func_names	Entropy of parameters in the overall method
		vba_entropy_words	All areEntropy of words
vba_mean_loc_per_func	Average number of lines occupied by each method

The training features in step S102 are features determined according to preset training sample documents, and the preset training sample documents include training sample documents provided with malicious class labels and training sample documents provided with normal class labels.

When the preset features include training features, the file detection method provided in this embodiment may further include a step of determining the training features according to preset training sample documents:

and performing feature screening on preset training sample documents based on at least one of a bag-of-words model and a TF-IDF model to obtain the training features.

The word bag model needs to perform word segmentation (i.e., feature differentiation) on macro codes in training sample documents, count the number of times each word appears in the documents after word segmentation (i.e., count the number of times each feature appears in the documents), and perform screening on the training features according to the number of times. In the embodiment, when the word bag model performs feature screening by counting word frequency, word frequency statistics can be performed by using a sentvectorer method of sklern.

The TF-IDF model may evaluate how important each word is to a set of documents or a document to a corpus. The importance of a word increases in proportion to the number of times it appears in a document, but at the same time decreases in inverse proportion to the frequency with which it appears in the corpus. In the embodiment, the importance degree of the features is evaluated based on the TF-IDF model, the higher the importance degree is, the more representative the features are, and the features can be screened out to be used as preset features for distinguishing malicious documents from normal documents. The TF-IDF model in this embodiment can be subjected to feature screening by using a Sklear TF-IDFTransformer method.

The method comprises the following steps of performing feature screening on preset training sample documents based on at least one of a bag-of-words model and a TF-IDF model, wherein the specific scheme for obtaining the training features can be any one of the following schemes:

the first method comprises the following steps: and carrying out feature screening on preset training sample documents based on a bag-of-words model to obtain bag-of-words features, and taking the bag-of-words features as the training features.

And the second method comprises the following steps: and performing feature screening on a preset training sample document based on a TF-IDF model to obtain TF-IDF features, and taking the TF-IDF features as the training features.

And the third is that: and performing feature screening on preset training sample documents based on a TF-IDF model to obtain TF-IDF features, and filtering the TF-IDF features based on a bag-of-words model to obtain the training features.

And fourthly: and carrying out feature screening on the preset training sample document based on a bag-of-words model to obtain bag-of-words features, carrying out feature screening on the preset training sample document based on a TF-IDF model to obtain TF-IDF features, and solving a union set or an intersection set of the bag-of-words features and the TF-IDF features to obtain the training features.

And a fifth mode: and carrying out feature screening on preset training sample documents based on a bag-of-words model to obtain bag-of-words features, and filtering the bag-of-words features based on a TF-IDF model to obtain the training features.

S103: and inputting the characteristic information into a preset malicious document detection model to obtain a detection result of the document to be detected.

The malicious document detection model in step S103 may be a model obtained by performing model training based on the training sample document, and certainly, in some other embodiments, may also be a model obtained by performing model training based on other training sample documents. The specific generation process of the detection model may refer to the model training process in the following embodiment two.

In an embodiment, the detection of the malicious document may be performed only according to the detection model, that is, the detection result of the model is directly used as the final detection result:

when the detection result in step S103 indicates that the type of the document to be detected is a malicious document, the document to be detected may be determined as a malicious document, and when the detection result indicates that the type of the document to be detected is a normal document, the document to be detected may be determined as a normal document.

In another embodiment, the detection of the malicious document may be performed by combining two means, namely a detection model and a key code detection, specifically including:

when the detection result indicates that the type of the document to be detected is a malicious document, directly determining that the document to be detected is the malicious document; and when the detection result indicates that the type of the document to be detected is a normal document, comparing the macro code with a preset key code, and determining whether the type of the document to be detected is changed from the normal document to a malicious document according to the comparison result.

In the embodiment, a first round of detection is performed based on the model, then a second round of detection is further performed from the key code detection dimension, and the detection result of the detection model is corrected, so that the finally obtained detection result is more accurate.

It should be noted that the preset critical code in the present embodiment may include at least one of a first type of code corresponding to the self-starting behavior, a second type of code corresponding to the writing behavior, and a third type of code corresponding to the executing payload behavior. When the macro code of the document to be detected has the type code matched with the preset key code, the type of the document to be detected can be changed from a normal document to a malicious document. Fig. 2 shows a first type of code corresponding to the self-starting behavior, fig. 3 shows a second type of code corresponding to the self-writing behavior, and fig. 4 shows a third type of code corresponding to the execution payload behavior.

In the embodiment, the behavior of the macro code of the document to be detected is analyzed through a key code detection technology, and the existence of the type code matched with the preset key code in the macro code of the document to be detected indicates that the macro code has the behavior corresponding to the preset key code.

In order to ensure the accuracy of the correction result, the preset key code may include at least two of a first type of code corresponding to the self-starting behavior, a second type of code corresponding to the writing behavior, and a third type of code corresponding to the executing payload behavior.

And when at least two types of codes matched with the preset key codes exist in the macro codes of the document to be detected, changing the type of the document to be detected from a normal document into a malicious document.

The document detection method provided by the embodiment of the application is essentially a malicious document detection method based on machine learning, based on the scheme of machine learning, by extracting and analyzing macro codes in the document to be detected and detecting the macro codes based on a preset malicious document detection model, unknown malicious documents can be identified, the unknown malicious documents do not need to be placed in a sandbox for execution, the detection speed is higher, and the cost is lower.

Example two:

an embodiment of the present application provides a method for training a malicious document detection model, please refer to fig. 5, which includes:

s501: and acquiring a training sample document provided with a malicious class label and a training sample document provided with a normal class label.

For example, 28225 malicious documents and 10509 normal documents may be obtained as training sample documents, and of course, the number of training sample documents may be set according to actual requirements.

S502: and acquiring the macro code of each training sample document.

S503: and extracting feature information corresponding to preset features from each macro code, wherein the preset features comprise at least one of statistical features and training features.

The statistical features are features in the macro code used for representing statistical data, and the training features are features determined according to the training sample documents. It should be particularly noted that, when the preset features include training features, after step S502 and before step S503, the following steps may be included:

and performing feature screening on the obtained preset training sample document based on at least one of the bag-of-words model and the TF-IDF model to obtain the training features.

The type and specific setting manner of the preset features in this embodiment are similar to those in the first embodiment, and are not described herein again.

S504: and performing model training according to the characteristic information of each training sample document to obtain a malicious document detection model.

In step S504, a random forest machine learning algorithm may be selected to perform model training on the feature information. Of course, other machine learning algorithms may be used for model training.

Example three:

an embodiment of the present application provides a document detecting apparatus, please refer to fig. 6, including:

the obtaining module 601 is configured to obtain a macro code of a document to be detected.

The extracting module 602 is configured to extract feature information corresponding to a preset feature from a macro code of a document to be detected; the preset features include at least one of statistical features and training features. The statistical characteristics are characteristics used for representing statistical data in the macro codes, the training characteristics are characteristics determined according to preset training sample documents, and the preset training sample documents comprise training sample documents with malicious class labels and training sample documents with normal class labels.

The detection module 603 is configured to input the feature information of the document to be detected into a preset malicious document detection model to obtain a detection result of the document to be detected.

In an exemplary embodiment, the apparatus further includes a determination module configured to determine training features according to a preset training sample document.

In an exemplary embodiment, the determining module is configured to perform feature screening on a preset training sample document based on at least one of a bag-of-words model and a TF-IDF model to obtain the training features. Specifically, but not limited to, any one of the following determination modes:

the first method comprises the following steps: and carrying out feature screening on preset training sample documents based on a bag-of-words model to obtain bag-of-words features, and determining the bag-of-words features as the training features.

And the second method comprises the following steps: and performing feature screening on a preset training sample document based on a TF-IDF model to obtain TF-IDF features, and determining the TF-IDF features as the training features.

In an exemplary embodiment, the apparatus may further include a modification module, where the modification module is configured to compare the macro code with a preset key code when the detection result indicates that the type of the document to be detected is a normal document, and determine whether to change the type of the document to be detected from the normal document to a malicious document according to a comparison result.

It should be noted that the preset critical code in the present embodiment may include at least one of a first type of code corresponding to the self-starting behavior, a second type of code corresponding to the writing behavior, and a third type of code corresponding to the executing payload behavior. When the detection result indicates that the type of the document to be detected is a normal document, if the macro code of the document to be detected has a type code matched with the preset key code, the modification module can change the type of the document to be detected from the normal document to a malicious document.

In an exemplary embodiment, in order to ensure the accuracy of the correction result, the preset critical code may include at least two of a first type of code corresponding to the self-starting behavior, a second type of code corresponding to the writing behavior, and a third type of code corresponding to the executing payload behavior.

And the correction module is used for changing the type of the document to be detected from the normal document to the malicious document when the detection result indicates that the type of the document to be detected is the normal document and at least two types of codes matched with the preset key codes exist in the macro codes of the document to be detected.

Example four:

based on the same inventive concept, an embodiment of the present application provides a terminal, please refer to fig. 7, which includes a processor 701 and a memory 702, where the memory 702 stores a computer program, and the processor 701 executes the computer program to implement the steps of the method in the first embodiment and/or the second embodiment, which are not described herein again.

It should be noted that the device in the present embodiment may be a PC (Personal Computer), a mobile phone, a tablet Computer, a notebook Computer, a virtual host, and the like. Or may be a rack server, a blade server, a tower server or a rack server (including an independent server or a server cluster composed of a plurality of servers), etc.

It will be appreciated that the configuration shown in figure 7 is merely illustrative and that the apparatus may also include more or fewer components than shown in figure 7 or have a different configuration than shown in figure 7.

The processor 701 may be an integrated circuit chip having signal processing capabilities. The Processor 701 may be a general-purpose Processor, including a Central Processing Unit (CPU), a Network Processor (NP), and the like; but may also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components. Which may implement or perform the various methods, steps, and logic blocks disclosed in the embodiments of the present application.

The Memory 702 may include, but is not limited to, a Random Access Memory (RAM), a Read Only Memory (ROM), a Programmable Read Only Memory (PROM), an Erasable Read Only Memory (EPROM), an Electrically Erasable Read Only Memory (EEPROM), and the like.

The present embodiment further provides a computer-readable storage medium, such as a floppy disk, an optical disk, a hard disk, a flash Memory, a usb (Secure Digital Memory Card), an MMC (Multimedia Card), etc., where one or more programs for implementing the above steps are stored in the computer-readable storage medium, and the one or more programs may be executed by one or more processors to implement the steps of the method in the first embodiment and/or the second embodiment, and will not be described herein again.

The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments.

Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation.

The above description is only an example of the present application and is not intended to limit the scope of the present application, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present application shall be included in the protection scope of the present application.

Claims

1. A method of document detection, comprising:

acquiring a macro code of a document to be detected;

2. The document detection method of claim 1, wherein when the preset features include the training features, the method further comprises the step of determining the training features from the preset training sample documents:

3. The method of claim 2, wherein the feature screening of the preset training sample document based on at least one of a bag-of-words model and a TF-IDF model to obtain the training features comprises:

or the like, or, alternatively,

and carrying out feature screening on the preset training sample document based on a bag-of-words model to obtain bag-of-words features, and filtering the bag-of-words features based on a TF-IDF model to obtain the training features.

4. The document detection method of claim 1, wherein the method further comprises:

5. The document detection method of claim 4, wherein the predetermined key code comprises at least two of a first type of code corresponding to a self-start behavior, a second type of code corresponding to a write behavior, and a third type of code corresponding to an execute payload behavior.

6. The document detection method according to claim 5, wherein the determining whether to change the type of the document to be detected from a normal document to a malicious document according to the comparison result includes:

7. A malicious document detection model training method is characterized by comprising the following steps:

acquiring macro codes of the training sample documents;

8. A document sensing device, comprising:

9. A terminal, characterized in that it comprises a processor and a memory, in which a computer program is stored, which computer program is executed by the processor to implement the method according to any one of claims 1-7.

10. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program which, when executed by at least one processor, implements the method according to any one of claims 1-7.