CN115730313A

CN115730313A - Malicious document detection method and device, storage medium and equipment

Info

Publication number: CN115730313A
Application number: CN202211550903.8A
Authority: CN
Inventors: 徐晓; 薛智慧; 余小军
Original assignee: Beijing Topsec Technology Co Ltd; Beijing Topsec Network Security Technology Co Ltd; Beijing Topsec Software Co Ltd
Current assignee: Beijing Topsec Technology Co Ltd; Beijing Topsec Network Security Technology Co Ltd; Beijing Topsec Software Co Ltd
Priority date: 2022-12-05
Filing date: 2022-12-05
Publication date: 2023-03-03

Abstract

The embodiment of the application provides a malicious document detection method, a malicious document detection device, a malicious document detection storage medium and malicious document detection equipment. Therefore, by extracting and detecting the calling sequence based on the static state, the malicious document detection is realized, and the system performance overhead is effectively reduced.

Description

Malicious document detection method and device, storage medium and equipment

Technical Field

The present application relates to the field of network information security technologies, and in particular, to a malicious document detection method, apparatus, storage medium, and device.

Background

Office documents are widely applied to daily Office activities of enterprises and public institutions and are important components forming internet documents. Macros are important functional extensions of Office documents and are also frequently used. Hackers also often use this function to embed malicious macro code in Office documents to achieve the goal of network attack.

Malicious Office document detection is crucial to network information security, and at present, in a detection scheme for malicious Office documents in the related art, features of malicious documents are mainly extracted in a dynamic mode executed by simulation codes, so that the malicious documents are clustered. However, this approach requires simulation of a real environment, which is resource consuming.

Disclosure of Invention

An object of the embodiments of the present application is to provide a method, an apparatus, a storage medium, and a device for detecting a malicious document, which aim to solve the problem of large resource consumption in a detection manner for a malicious Office document in the related art.

In a first aspect, a malicious document detection method provided in an embodiment of the present application includes:

extracting code information in a document to be detected, and constructing a function calling sequence by using the code information;

after the function calling sequence is subjected to label marking based on a preset label library, converting the function calling sequence into target characteristics according to the marked labels; the preset label library records keywords for matching each label in the behavior purpose classification and keywords for matching each label in the behavior means classification;

inputting the target characteristics into a trained detection model to judge whether the document to be detected is a malicious document; the detection model is obtained based on benign document samples and malicious document samples through training, and the characteristics of the benign document samples and the characteristics of the malicious document samples are obtained based on the preset tag library.

In the implementation process, after a function call sequence is constructed based on code information in a document to be detected, a function is marked by using a behavior purpose classification label and a behavior means classification label in a preset label library, so that the unordered function call sequence is converted into a machine learning characteristic with determined dimension, and the machine learning characteristic is processed by using a trained detection model to judge whether the document to be detected is a malicious document. Therefore, by extracting and detecting the calling sequence based on the static state, the malicious document detection is realized, and the system performance overhead is effectively reduced.

Further, in some embodiments, the extracting code information in the document to be detected includes:

and analyzing the file for storing the code information in the document to be detected according to the analysis method corresponding to the type of the document to be detected.

In the implementation process, a specific way for extracting code information in the document is provided.

Further, in some embodiments, before parsing the file for storing the code information in the document to be detected according to the parsing method corresponding to the type of the document to be detected, the method includes:

if the document to be detected is an Office2007 or later version of Office2007, an Override child node of a ContentType node is searched under a compressed package root directory of the document to be detected, and then the position of a file for storing code information in the document to be detected is determined based on the value of the Override child node.

In the implementation process, considering that the malicious Office2007+ document can utilize the file name for modifying the code information in the stored document to realize evasive analysis, aiming at the Office2007+ document to be detected, the location of the file for storing the code information is determined by searching the Override child node, so that the malicious Office2007+ document is effectively analyzed.

Further, in some embodiments, said constructing a sequence of function calls using said code information comprises:

scanning the code information to obtain a function unit in the code;

and constructing a function calling sequence according to the calling relation corresponding to the function unit.

In the implementation process, a specific way for constructing the function call sequence is provided, namely, the code information is scanned first, the function unit in the code is obtained, and then the function call sequence is constructed according to the call relation corresponding to the function unit.

Further, in some embodiments, the labels in the behavioral purpose classification include a network connection classification label, a write permission operation classification label, an execution permission operation classification label, a system environment variable classification label, an operating system library call classification label, and a shell operation classification label;

the labels in the behavioral means category include an obfuscated category label, an auto-executed category label, and a window-hidden category label.

In the implementation process, the fine classification of behavior target classification and behavior means classification is provided, and through the nine classes, the characteristics which can be used for accurately judging whether the document is malicious or not can be effectively extracted.

Further, in some embodiments, the converting the sequence of function calls into target features according to the tagged tag comprises:

generating a calling sequence matrix according to the labels in the behavior purpose classification contained in the labels marked by the function calling sequence, and converting the calling sequence matrix into a first characteristic;

generating a second characteristic according to the number of keywords corresponding to each label in the behavior means classification hit by the function calling sequence;

constructing a target feature based on the first feature and the second feature.

In the implementation process, when the function calling sequence is converted into the machine learning characteristics, the behavior purpose classification labels and the behavior means classification labels are processed separately, so that the over-fitting of a machine learning model is prevented, and the system performance overhead is reduced.

Further, in some embodiments, the generating a call sequence matrix according to the labels in the behavior purpose classification included in the labels marked by the function call sequence, and then converting the call sequence matrix into the first feature includes:

replacing the original function name in the function calling sequence by using the label in the behavior purpose classification contained in the label marked by the function calling sequence, and generating a calling sequence matrix according to the replaced function calling sequence;

and converting the calling sequence matrix into a first characteristic according to a row-first or column-first mode.

In the implementation described above, a specific way to turn an unordered call sequence into a dimensionally determined machine learning feature is provided.

In a second aspect, an embodiment of the present application provides an apparatus for detecting a malicious document, including:

the construction module is used for extracting code information in the document to be detected and constructing a function calling sequence by utilizing the code information;

the conversion module is used for converting the function calling sequence into target characteristics according to the marked label after the function calling sequence is subjected to label marking based on a preset label library; the preset label library records keywords for matching each label in the behavior purpose classification and keywords for matching each label in the behavior means classification;

the judging module is used for inputting the target characteristics into a trained detection model so as to judge whether the document to be detected is a malicious document; the detection model is obtained based on benign document samples and malicious document samples through training, and the characteristics of the benign document samples and the characteristics of the malicious document samples are obtained based on the preset tag library.

In a third aspect, an embodiment of the present application provides an electronic device, including: memory, a processor and a computer program stored in the memory and executable on the processor, the processor implementing the steps of the method according to any of the first aspect when executing the computer program.

In a fourth aspect, an embodiment of the present application provides a computer-readable storage medium having instructions stored thereon, which, when executed on a computer, cause the computer to perform the method according to any one of the first aspect.

In a fifth aspect, embodiments of the present application provide a computer program product, which when run on a computer, causes the computer to perform the method according to any one of the first aspect.

Additional features and advantages of the disclosure will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by the practice of the above-described technology disclosed herein.

In order to make the aforementioned objects, features and advantages of the present application more comprehensible, preferred embodiments accompanied with figures are described in detail below.

Drawings

To more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are required to be used in the embodiments of the present application will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present application and therefore should not be considered as limiting the scope, and that those skilled in the art can also obtain other related drawings based on the drawings without inventive efforts.

Fig. 1 is a flowchart of a malicious document detection method according to an embodiment of the present application;

fig. 2 is a schematic diagram of a workflow of a machine learning malicious Office detection scheme based on tag function call according to an embodiment of the present application;

fig. 3 is a schematic diagram of an original call sequence diagram according to an embodiment of the present application;

fig. 4 is a schematic diagram of a function call sequence obtained by replacing an original function name in an original call sequence diagram with a mapped tag and removing tags in behavior means classification according to the embodiment of the present application;

fig. 5 is a schematic diagram of a call sequence matrix according to an embodiment of the present application;

fig. 6 is a block diagram of a malicious document detection apparatus according to an embodiment of the present application;

fig. 7 is a block diagram of an electronic device according to an embodiment of the present disclosure.

Detailed Description

The technical solutions in the embodiments of the present application will be described below with reference to the drawings in the embodiments of the present application.

It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined or explained in subsequent figures. Meanwhile, in the description of the present application, the terms "first", "second", and the like are used only for distinguishing the description, and are not to be construed as indicating or implying relative importance.

As described in the background art, the detection method for malicious Office documents in the related art has a problem of large resource consumption. Based on this, the embodiment of the present application provides a malicious document detection scheme to solve the above problem.

Embodiments of the present application are described below:

as shown in fig. 1, fig. 1 is a flowchart of a malicious document detection method provided by an embodiment of the present application, which may be applied to a terminal or a server, where the terminal may be various electronic devices including, but not limited to, a smart phone, a tablet computer, a laptop portable computer, a desktop computer, and the like; the server may be a single server or a distributed server cluster consisting of a plurality of servers. The terminal or the server provides a document Processing environment, which includes a software portion and a hardware portion, wherein the software portion mainly includes an operating system, such as Windows and Linux, and the hardware portion mainly includes computing resources and storage resources, such as a Central Processing Unit (CPU), a memory, a hard disk, and the like. It should be noted that, the terminal/server may also be implemented as multiple software or software modules, or may also be implemented as a single software or software module, which is not limited in this application.

The method comprises the following steps:

step 101, extracting code information in a document to be detected, and constructing a function calling sequence by using the code information;

the document to be detected in this step may be an Office document that needs to be subjected to malicious document detection, such as a Word document, a PDF document, and the like. The function call sequence can be regarded as a function call relation in the code, including a function name of a calling function, a function name of a called function and a representation of the function call sequence, and the representation form can be a sequence diagram, a tuple and the like.

In this embodiment, the function call sequence is constructed by using Code information in the document to be detected, where the Code information may be a position where a malicious Code may be embedded, and the document to be detected is an Office document, so the Code information may refer to a VBA (Visual Basic for Applications) Code and a P Code (Performance-cache Code, a form of VBA Code after compilation). In practical applications, office documents are generally divided into two types, one is Office97-2003 version, and the other is Office2007+ version, wherein Office97-2003 documents are Compound File Binary Format (CFBF) and belong to OLE (Object Linking and Embedding) files; while Office2007+ documents conform to the OOXML (Office Open XML) standard, belonging to an OOXML file. The two types of files are different in the aspects of the position of the object attribute in the stored document, the parsing mode, and the like, and therefore, in some embodiments, the extracting code information in the document to be detected in this step may include: and analyzing the file for storing the code information in the document to be detected according to the analysis method corresponding to the type of the document to be detected. Specifically, the type of the document to be detected can be determined by reading Magic of the file, generally speaking, if the first 8 bytes of the file are "D0 CF11E0 A1 B1 1A E1", the document to be detected is Office97-2003 document, and if the first two bytes of the file are "PK", the document to be detected is Office2007+ document; in the Office97-2003 document, the VBA code and the P code are generally stored in a Module Stream of the document, and in the Office2007+ document, the VBA code and the P code are generally stored in a vbproject. Therefore, according to the document type, the Module Stream or vbaject bin file storing the VBA code and the P code is analyzed, and the code information of the document to be detected can be extracted.

In addition, experiments show that malicious Office2007+ documents can achieve the purpose of avoiding parsing by modifying vbaproject.bin file names, so in some embodiments, if the document to be detected is an Office2007 or a version after the Office2007, an Override child node of a ContentType node is found under a compressed package root directory of the document to be detected, and then a position of a file for storing code information in the document to be detected is determined based on a value of the Override child node. That is, when locating the file location storing VBA code and P code in the Office2007+ document, an Override child node may be found in the [ Content _ Types ] xml file under the zip root directory, where the child node is actually a tag "Override PartName" for defining the Content type corresponding to the specific segment file, and the value of the tag is the location of the corresponding VBA code and P code. Based on the method, the malicious Office2007+ document can be effectively analyzed.

Further, the constructing of the function call sequence by using the code information mentioned in this step may include: scanning the code information to obtain a function unit in the code; and constructing a function calling sequence according to the calling relation corresponding to the function unit. Specifically, in the code, a function generally starts with a Sub and ends with an End Sub, so that the code is marked by an original function name according to the Sub and End Sub to divide the function unit; in VBA code, the syntax of function call is generally direct reference function name, while in P code, the syntax of function call is generally argscal function name, according to these two syntaxes, the original call sequence diagram of function can be obtained.

In step 102, after the function call sequence is labeled based on a preset label library, the function call sequence is converted into a target feature according to the labeled label; the preset label library records keywords of all labels in the classification for matching action purposes and keywords of all labels in the classification for matching action means;

considering that the function name is randomly confused in the code, so that the function name is random and meaningless, and meanwhile, the calling sequence of the malicious document is usually changeable and unpredictable, the embodiment performs specific label marking on the function, extracts an effective label from the document to be detected by using the obfuscation technology, and simultaneously uses the label to replace the complicated and changeable original function calling sequence, so that the calling sequence is normalized, and thus, the unordered calling sequence is converted into the machine learning characteristic.

Specifically, the preset tag library mentioned in this step is a pre-established tag library, in which keywords for matching each tag in the category of behavior purpose and keywords corresponding to each tag in the category of behavior means are recorded, that is, when a function call sequence of a document to be detected hits a keyword corresponding to a certain tag, the function call sequence is marked by using the tag. The keywords are special function keywords in the function, and are classified according to behavior purposes and behavior means, and as the name suggests, the keywords corresponding to the classification of the behavior purposes are keywords related to attack behavior purposes which can be implemented by an attacker, such as network connection, execution authority operation and the like, and the keywords corresponding to the behavior means are keywords related to behavior means which are adopted by the attacker for achieving the attack behavior purposes, such as confusion, window hiding and the like. Therefore, the method is beneficial to extracting the characteristics which can be used for accurately judging whether the document is malicious or not.

In some embodiments, the labels in the behavioral purpose classification may include a network connection classification label, a write permission operation classification label, an execution permission operation classification label, a system environment variable classification label, an operating system library call classification label, and a shell operation classification label; the labels in the behavioral means category include an obfuscated category label, an auto-executed category label, and a window-hidden category label. That is, the labels in the preset label library are divided into nine categories, wherein the network connection category, the write permission operation category, the execution permission operation category, the system environment variable category, the operating system library call label and the shell operation category are marked as behavior purpose categories, and the confusion category, the automatic execution category and the window hiding category are marked as behavior means categories. Each category corresponds to a label, for example, "Connections" is a keyword related to network connection, and if the function includes the keyword "Connections," the function is labeled as a network connection classification label webconnection _ tag; as another example, "Replace" is a keyword associated with the obfuscation function, and if the keyword "Replace" is included in the function, the function is labeled as the obfuscation classification tag object _ tag. Through the nine classification tags, all function units in the code are marked. It should be noted that one function may have multiple tag labels, and one tag label may be shared by multiple functions. Of course, in other embodiments, the tags in the preset tag library may also have different settings according to the requirements of the actual scene.

Further, in some embodiments, the converting the sequence of function calls to the target feature according to the tagged tag mentioned in this step may include: generating a calling sequence matrix according to the labels in the behavior purpose classification contained in the labels marked by the function calling sequence, and converting the calling sequence matrix into a first characteristic; generating a second characteristic according to the number of keywords corresponding to each label in the behavior means classification hit by the function calling sequence; constructing a target feature based on the first feature and the second feature. Specifically, in order to reduce the complexity of the call sequence matrix and reduce the redundant number of features, when one function contains a label in the behavior means classification, the label is ignored in the process of generating the call sequence matrix, and when one function only contains a label in the behavior means classification, the next called function containing the behavior purpose classification label is directly searched, so that the behavior means classification label is removed by the call sequence. According to the fact that the behavior purpose classification includes six types of labels, a 6 x 6 two-dimensional matrix can be constructed, vectors of each dimension are respectively a behavior purpose classification label, then the two-dimensional matrix is constructed according to the function labels in the function calling sequence, and the two-dimensional matrix is converted into fixed dimension features, and the first features can be obtained.

Further, the first feature may be obtained based on: replacing the original function name in the function calling sequence by using the label in the behavior purpose classification contained in the label marked by the function calling sequence, and generating a calling sequence matrix according to the replaced function calling sequence; and converting the calling sequence matrix into the first characteristic according to a row-first or column-first mode. That is, the original function name in the function call sequence is replaced with a behavioral purpose classification tag to generate a call sequence matrix, for example, a network connection classification tag, a write permission operation classification tag, an execution permission operation classification tag, a system environment variable classification tag, an operating system library call classification tag, and a shell operation classification tag are sequentially represented as a, B, C, D, E, and F, and when a key word corresponding to the shell operation classification tag is included in the call sequence of the original function, the home function AutoOpen () includes a key word corresponding to the network connection classification tag, a key word corresponding to the write permission operation classification tag, and a key word corresponding to the shell operation classification tag, and the zFcKWSPrk () includes a key word corresponding to the network connection classification tag, a key word corresponding to the write permission operation classification tag, and a key word corresponding to the shell operation classification tag, in the function call sequence, the original function name of the home function can be replaced with F _ func, B _ func, F _ func, and accordingly, the replaced function call sequence becomes F _ func → a, B _ func, F _ func; then, a calling sequence matrix can be generated according to the replaced function calling sequence, at this time, in the calling sequence matrix, the values of the 1 st row, the 6 th column, the 2 nd row, the 6 th column and the 6 th row, the 6 th column are 1, and the values of the rest positions are 0; finally, the calling sequence matrix can be converted into the first feature with dimension 36 according to the row-first or column-first mode. Certainly, in other embodiments, the first feature may also be obtained by adopting other obtaining manners, for example, converting the calling function into a column vector according to the classification label of the action purpose, converting the called function into a row vector according to the classification label of the action purpose, and obtaining the first feature according to a product of the column vector and the row vector.

The embodiment separately marks the behavior means classification labels in the function calling sequence as features so as to reduce feature dimensions generated by the calling sequence, be beneficial to preventing over-fitting of a subsequent machine learning model and reduce system performance overhead. Specifically, counting the occurrence frequency of each label in the behavior means classification, if a function contains a keyword 'StrReverse' corresponding to an obfuscated classification label, adding 1 to the number of the obfuscated classification labels, and then, executing an operation of adding 1 to the number of the obfuscated classification labels as long as one of the keywords corresponding to the obfuscated classification labels is hit in the function or other functions in the document; and if the function does not contain any key word corresponding to the window hiding classification label, the window hiding classification label is 0. And finally, according to the number of the keywords corresponding to each label in the classification of the function call sequence hit behavior means, a second feature with the dimension of 3 can be generated.

After the first feature and the second feature are obtained, the first feature and the second feature may be combined into a target feature in a Concat or Add manner, so as to obtain a machine learning feature that can characterize a code condition of a document to be detected.

Step 103, inputting the target features into a trained detection model to judge whether the document to be detected is a malicious document; the detection model is obtained based on benign document samples and malicious document samples through training, and the characteristics of the benign document samples and the characteristics of the malicious document samples are obtained based on the preset tag library.

The detection model mentioned in the step is a machine learning model, training samples of the detection model comprise benign document samples and malicious document samples, the characteristics of the benign document samples and the malicious document samples are obtained on the basis of the extraction in the step 101 and the step 102, the fixed number characteristics converted from the calling sequence matrix constructed by the classification labels of the action purposes and the quantity characteristics formed by the classification labels of the action means, which correspond to the training samples, are input into a machine learning classification algorithm for training, wherein the machine learning classification algorithm comprises but is not limited to a decision tree, a support vector machine, a gradient lifting decision tree and the like, and then parameters are adjusted and trained until the performance is optimal, and the detection model is output. Therefore, the detection model can learn to classify the document to be detected according to the target characteristics of the document to be detected, namely, whether the document to be detected is a benign document or a malicious document is judged. For a specific training process, reference may be made to the introduction of a corresponding machine learning algorithm in the related art, which is not described herein again.

Optionally, the output value of the detection model is a floating point number between 0 and 1, after the target feature is input into the detection model, if the output value of the detection model is greater than or equal to a threshold value, the document to be detected is determined as a malicious document, otherwise, the document to be detected is determined as a benign document. The threshold value here may be set according to the degree of balance of the training samples, and for example, when the number of benign document samples and malicious document samples is equal, the threshold value may be set to 0.5. Of course, in other embodiments, the threshold may also be set differently according to the requirements of the actual scene.

According to the method and the device for detecting the document, after the function call sequence is established based on the code information in the document to be detected, the function is marked by using the behavior purpose classification label and the behavior means classification label in the preset label library, so that the unordered function call sequence is converted into the machine learning characteristic with the determined dimension, and then the machine learning characteristic is processed by using the trained detection model to judge whether the document to be detected is a malicious document. Therefore, by extracting and detecting the calling sequence based on the static state, the malicious document detection is realized, and the system performance overhead is effectively reduced.

To illustrate the solution of the present application in more detail, a specific embodiment is described below:

in the field of network information security, malicious Office document detection is crucial to network information security, and the existing malicious Office detection method mainly comprises a dynamic detection method and a static detection method, wherein the dynamic detection method is used for simulating a real environment, so that the performance overhead is high; in the static detection method, a method based on a rule base usually depends on a priori knowledge base, and an unknown sample cannot be detected. Based on this, the present embodiment provides a tag function call-based machine learning malicious Office detection scheme, where a workflow of the scheme is as shown in fig. 2, and includes:

s201, function calling of extracting a training sample; this step specifically includes S2011 to S2015 as follows:

s2011, judging whether the input Office document is an OLE file or an OOXML file; specifically, after inputting a file, reading Magic of the file first, if the first 8 bytes of the file are "D0 CF11E 0A 1B 1A 1A E1", proving to be an Office97-2003 document, and if the first two bytes are "PK", proving to be an Office2007+ document;

s2012, extracting a VBA code and a P code of the document; specifically, according to the document type, module Stream or vbaproject. Bin storing VBA code and P code is analyzed;

s2013, acquiring a unified representation form of the codes; specifically, the code is processed as follows: if the P code and the VBA code exist, the VBA code is taken for subsequent processing; if only the VBA code exists, only the VBA code is taken; if only P codes exist, disassembling the P codes to obtain compiled codes, marking the compiled codes as DisAsm _ P-codes, and taking the DisAsm _ P-codes for subsequent processing;

s2014, extracting original function units in the document; specifically, the code obtained in S2013 is scanned to obtain function units in the code, where it should be noted that, in the code, a function generally starts with a Sub and ends with an End Sub, so that the code is marked with an original function name according to the Sub and End Sub, and the function units are divided;

s2015, obtaining an original calling sequence diagram according to a calling relation corresponding to an original function unit; specifically, in VBA code, the syntax of function call is to directly refer to a function name; in the DisAsm _ p-code, the grammar of function calling is ArgCall function name, and an original calling sequence diagram of the function is obtained according to the two grammars;

s202, using label marks, converting a function calling sequence into fixed dimension characteristics according to behavior purpose classification labels, and individually marking the behavior means classification labels as the characteristics; this step specifically includes S2021 to S2022 as follows:

s2021, analyzing the special function keywords in the function, classifying the special function keywords, and then establishing a label library; specifically, if the keywords "Mid", "Left", "Right", "StrReverse", "Xor", "ChrB", "ChrW", "Chr", "Replace", "Hex", etc. are keywords related to the confusion function, these keywords are classified as confusion classification, and labeled as obfustation _ tag; if the keywords "webrole", "net", "Socket", "Connections", "WorkbookConnection", etc. are keywords related to network connection, the keywords are classified as network connection categories and labeled as webconnection _ tag; in addition, there are nine categories, namely, auto _ tag for automatic execution, writerights operation, writeAction _ tag, runAction _ tag for execution rights operation, sysEnv _ tag for system environment variables, osdlCall _ tag for OS library calls, hideWindow _ tag for window hiding, and Shell _ tag for shell operations. Among the nine categories, network connection classification, writing permission operation classification, execution permission operation classification, system environment variable classification, operating system library calling classification and shell operation classification belong to behavior purpose classification; confusion classification, automatic execution classification and window hiding classification belong to behavior means classification;

s2022, performing function label marking on the obtained function unit according to the self-built label library; specifically, all function units in the code are marked through nine classification labels, and if the function contains a keyword 'Connections', the function is marked as webconnection _ tag; it should be noted that one function may have multiple labels, and at the same time, one label may be shared by multiple functions;

s2023, acquiring a calling sequence of the marking function and converting the calling sequence into a fixed dimension feature; specifically, the original function names in the original call sequence diagram obtained in step S201 are replaced with the tags that have been mapped with the respective function units in step S2022, and then a call sequence matrix is generated according to the function call sequence that has been replaced with the tags, in this process, in order to reduce the complexity of the call sequence matrix and reduce the number of feature redundancies, behavior means classification tags are removed when the call sequence matrix is generated; then, converting the data into the characteristics with the fixed number of 36 according to a row-first or column-first mode;

an example is illustrated: scanning codes in a certain Office document to obtain function units of AutoOpen (), zFcKWSPrk (), and obtaining an original calling sequence diagram shown in figure 3; when the labels are marked, the function AutoOpen () includes a keyword "AutoOpen" in the auto _ tag of the automatic execution classification, keywords "Right", "Left", "Mid" in the confusion classification auto _ tag, and keywords "shell", "exe" in the shell operation classification shell _ tag; <xnotran> zFcKWSPrk () obfuscation _ tag "Right", "Left", "Chr", webconnect _ tag "http", writeAction _ tag "DownloadFile", shell shell _ tag "shell", , writeAction _ tag, runAction _ tag, shell _ tag, webconnect _ tag, sysEnv _ tag, osdllCall _ tag, obfuscation _ tag, hideWindow _ tag, auto _ tag A, B, C, D, E, F, G, H, I, , AutoOpen () C _ func, G _ func, I _ func, zFcKWSPrk () A _ func, C _ func, D _ func, G _ func, , , , 4 , , 5 , 36 , [0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]; </xnotran>

S2024, individually marking behavior means such as confusion, window hiding and automatic execution as features; specifically, the characteristics corresponding to the behavior means classification are obtained by counting the occurrence frequency of each label in the classification; if the function contains the keyword "Strreverse", the number of confusion classification objective _ tag _ num is added with 1, and then, if one of the confusion classification objective _ tag keywords is hit in the function or other functions in the document, the number of confusion _ tag _ num is added with 1;

s203, training sample characteristics by using a machine learning classification algorithm to obtain a detection model; specifically, the fixed number features converted by the calling sequence constructed by the behavior purpose classification label and the quantity features formed by the behavior means classification are input into a machine learning classification algorithm for training, wherein the machine learning classification algorithm is not limited to a decision tree, a support vector machine, a gradient lifting decision tree and the like, and then parameters are adjusted and trained until the performance is optimal, and a detection model is output; in the training process, 9000 training sample characteristics of the detection model come from benign Office documents and malicious Office documents of the public data set respectively;

s204, judging whether the Office document to be detected is a malicious document or not according to the detection model; specifically, a fixed number feature converted by a calling sequence constructed by a behavior purpose classification label of the Office document to be detected and a quantity feature formed by behavior means classification are extracted, the feature is input into a detection model, if the output value of the detection model is greater than or equal to 0.5, the Office document to be detected is judged to be a malicious document, and otherwise, the Office document to be detected is judged to be a benign document.

According to the scheme of the embodiment of the application, static-based extraction and detection are carried out on the calling sequence, so that the system overhead is reduced; moreover, the function is subjected to specific label marking, the function name meaningless samples can be effectively labeled by using the confusion technology, and the labels are used for replacing a complex and changeable original function calling sequence to normalize the calling sequence, so that the disordered calling sequence is changed into the machine learning characteristic with definite dimension, the malicious Office document detection based on machine learning is realized, and the unknown samples can be detected; in addition, technical means such as an aliasing method and a window hiding method in a function calling sequence of a document are independently subjected to feature processing, so that feature dimensions generated by the calling sequence are reduced, overfitting of a machine learning model is prevented, and system performance overhead is reduced.

Corresponding to the foregoing method embodiments, the present application further provides embodiments of a malicious document detection apparatus and a terminal applied thereto:

as shown in fig. 6, fig. 6 is a block diagram of a malicious document detection apparatus provided in an embodiment of the present application, where the apparatus includes:

the building module 61 is used for extracting code information in the document to be detected and building a function calling sequence by using the code information;

a conversion module 62, configured to perform tag marking on the function call sequence based on a preset tag library, and convert the function call sequence into a target feature according to a marked tag; the preset label library records keywords of all labels in the classification for matching action purposes and keywords of all labels in the classification for matching action means;

a judging module 63, configured to input the target feature into a trained detection model, so as to judge whether the document to be detected is a malicious document; the detection model is obtained based on training of benign document samples and malicious document samples, and the characteristics of the benign document samples and the characteristics of the malicious document samples are obtained based on the preset tag library.

The implementation process of the functions and actions of each module in the above device is detailed in the implementation process of the corresponding steps in the above method, and is not described herein again.

Fig. 7 is a schematic view of an electronic device, and fig. 7 is a block diagram of the electronic device according to an embodiment of the present disclosure. The electronic device may include a processor 710, a communication interface 720, a memory 730, and at least one communication bus 740. Wherein the communication bus 740 is used for realizing direct connection communication of these components. In this embodiment, the communication interface 720 of the electronic device is used for performing signaling or data communication with other node devices. Processor 710 may be an integrated circuit chip having signal processing capabilities.

The Processor 710 may be a general-purpose Processor, including a Central Processing Unit (CPU), a Network Processor (NP), and the like; but may also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components. The various methods, steps, and logic blocks disclosed in the embodiments of the present application may be implemented or performed. A general purpose processor may be a microprocessor or the processor 710 may be any conventional processor or the like.

The Memory 730 may be, but is not limited to, a Random Access Memory (RAM), a Read Only Memory (ROM), a Programmable Read Only Memory (PROM), an Erasable Read Only Memory (EPROM), an electrically Erasable Read Only Memory (EEPROM), and the like. The memory 730 stores computer readable instructions, which when executed by the processor 710, can cause the electronic device to perform the steps involved in the method embodiment of fig. 1 described above.

Optionally, the electronic device may further include a memory controller, an input output unit.

The memory 730, the memory controller, the processor 710, the peripheral interface, and the input/output unit are electrically connected to each other directly or indirectly, so as to implement data transmission or interaction. For example, these components may be electrically coupled to each other via one or more communication buses 740. The processor 710 is adapted to execute executable modules stored in the memory 730, such as software functional modules or computer programs comprised by the electronic device.

The input and output unit is used for providing a task for a user to create and start an optional time period or preset execution time for the task creation so as to realize the interaction between the user and the server. The input/output unit may be, but is not limited to, a mouse, a keyboard, and the like.

It will be appreciated that the configuration shown in fig. 7 is merely illustrative and that the electronic device may include more or fewer components than shown in fig. 7 or have a different configuration than shown in fig. 7. The components shown in fig. 7 may be implemented in hardware, software, or a combination thereof.

The embodiments of the present application further provide a storage medium, where instructions are stored in the storage medium, and when the instructions are run on a computer, when the computer program is executed by a processor, the method described in the method embodiments is implemented, and for avoiding repetition, details are not repeated here.

The present application also provides a computer program product which, when run on a computer, causes the computer to perform the method of the method embodiments.

In the embodiments provided in the present application, it should be understood that the disclosed apparatus and method can be implemented in other ways. The apparatus embodiments described above are merely illustrative and, for example, the flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of apparatus, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

In addition, functional modules in the embodiments of the present application may be integrated together to form an independent part, or each module may exist alone, or two or more modules may be integrated to form an independent part.

The functions may be stored in a computer-readable storage medium if they are implemented in the form of software functional modules and sold or used as separate products. Based on such understanding, the technical solutions of the present application or portions thereof that substantially contribute to the prior art may be embodied in the form of a software product, which is stored in a storage medium and includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the methods described in the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk, and various media capable of storing program codes.

The above description is only an example of the present application and is not intended to limit the scope of the present application, and various modifications and changes may be made to the present application by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present application shall be included in the protection scope of the present application. It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined and explained in subsequent figures.

The above description is only for the specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present application, and shall be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrases "comprising a," "8230," "8230," or "comprising" does not exclude the presence of additional like elements in a process, method, article, or apparatus that comprises the element.

Claims

1. A malicious document detection method, comprising:

2. The method according to claim 1, wherein the extracting code information in the document to be detected comprises:

3. The method according to claim 2, wherein before parsing the file storing the code information in the document to be detected according to the parsing method corresponding to the type of the document to be detected, the method comprises:

4. The method of claim 1, wherein constructing a sequence of function calls using the code information comprises:

scanning the code information to obtain a function unit in the code;

5. The method of claim 1, wherein the labels in the behavioral purpose classification include a network connection classification label, a write permission operation classification label, an execute permission operation classification label, a system environment variable classification label, an operating system library call classification label, and a shell operation classification label;

6. The method of claim 5, wherein converting the sequence of function calls into a target feature according to the tagged tag comprises:

7. The method according to claim 6, wherein the step of generating a call sequence matrix according to the labels in the behavior purpose classification included in the labels marked by the function call sequence and converting the call sequence matrix into the first feature comprises:

and converting the calling sequence matrix into the first characteristic according to a row-first or column-first mode.

8. A malicious document detection apparatus, comprising:

the conversion module is used for converting the function calling sequence into target characteristics according to the marked labels after the function calling sequence is subjected to label marking based on a preset label library; the preset label library records keywords for matching each label in the behavior purpose classification and keywords for matching each label in the behavior means classification;

the judging module is used for inputting the target characteristics into a trained detection model so as to judge whether the document to be detected is a malicious document; the detection model is obtained based on training of benign document samples and malicious document samples, and the characteristics of the benign document samples and the characteristics of the malicious document samples are obtained based on the preset tag library.

9. A computer-readable storage medium, characterized in that a computer program is stored thereon which, when being executed by a processor, carries out the method according to any one of claims 1 to 7.

10. An electronic device comprising a processor, a memory, and a computer program stored on the memory and executable on the processor, wherein the processor implements the method of any of claims 1 to 7 when executing the computer program.