CN110363000B

CN110363000B - Method, device, electronic equipment and storage medium for identifying malicious files

Info

Publication number: CN110363000B
Application number: CN201910619922.3A
Authority: CN
Inventors: 朱学文; 罗丹
Original assignee: Shenzhen Tencent Domain Computer Network Co Ltd
Current assignee: Shenzhen Tencent Domain Computer Network Co Ltd
Priority date: 2019-07-10
Filing date: 2019-07-10
Publication date: 2023-11-17
Anticipated expiration: 2039-07-10
Also published as: CN110363000A

Abstract

The disclosure provides a method, a device, an electronic device and a storage medium for identifying malicious files, wherein the method comprises the following steps: grouping data of the file to be identified to form a file element matrix to be identified; generating a first vector of each first type element set based on each first type element set in the file element matrix to be identified, and generating a second vector of each second type element set based on each second type element set in the file element matrix to be identified; summarizing the combined vector of the first vector to obtain a first summary of the file to be identified; summarizing the combined vector of the second vector to obtain a second summary of the file to be identified; summarizing the combined vector of the first summary and the second summary of the file to be identified to obtain an index summary of the file to be identified; and comparing the index abstract of the file to be identified with the index abstract of the malicious file in the preset malicious file library, so as to determine whether the file to be identified is a malicious file. The embodiment of the disclosure can improve the efficiency of identifying the malicious files.

Description

Method, device, electronic equipment and storage medium for identifying malicious files

Technical Field

The disclosure relates to the field of network security, and in particular relates to a method, a device, electronic equipment and a storage medium for identifying malicious files.

Background

Today, in the internet throughout every corner of life, a user's personal terminal is often attacked by an illegal person, for example, an illegal person implants a malicious file such as a virus file, a Trojan file, etc. in the user's personal terminal, thereby disrupting the normal operation of the user's personal terminal, or stealing the user's private information. In order to protect users from malicious files, the malicious files need to be identified, and thus hit. In the prior art, the malicious files are mostly identified by calculating the abstracts of the files to be identified based on a high-complexity abstracting algorithm such as an MD5 algorithm, and comparing the abstracts with the abstracts of the malicious files. However, since the high complexity digest algorithm such as MD5 algorithm often requires a great amount of computing power, a great burden is imposed on the personal terminal of the user, which results in low recognition efficiency of malicious files.

Disclosure of Invention

An object of the present disclosure is to provide a method, an apparatus, an electronic device, and a storage medium for identifying a malicious file, which can improve efficiency of identifying a malicious file.

According to an aspect of an embodiment of the present disclosure, a method for identifying malicious files is disclosed, including:

grouping data of the file to be identified according to a preset data grouping rule, wherein each group of data is taken as one element in the element matrix of the file to be identified to form the element matrix of the file to be identified;

generating a first vector of the first type element set based on each first type element set in the file element matrix to be identified, and generating a second vector of the second type element set based on each second type element set in the file element matrix to be identified, wherein the first type element set and the second type element set are determined by dividing elements in the file element matrix to be identified according to a preset element dividing rule;

summarizing the combined vector of the first vectors of the first class element sets to obtain a first summary of the file to be identified;

summarizing the combined vector of the second vectors of each second class element set to obtain a second summary of the file to be identified;

summarizing a combined vector of a first summary and a second summary of a file to be identified to obtain an index summary of the file to be identified;

Comparing the index abstract of the file to be identified with the index abstract of the malicious file in a preset malicious file library, thereby determining whether the file to be identified is a malicious file.

According to an aspect of an embodiment of the present disclosure, an apparatus for identifying malicious files is disclosed, including:

the grouping module is configured to group the data of the file to be identified according to a preset data grouping rule, and each group of data is taken as one element in the element matrix of the file to be identified to form the element matrix of the file to be identified;

the element set vector generation module is configured to generate a first vector of each first type element set based on each first type element set in the element matrix of the file to be identified, and generate a second vector of each second type element set based on each second type element set in the element matrix of the file to be identified, wherein the first type element set and the second type element set are determined by dividing elements in the element matrix of the file to be identified according to a preset element division rule;

the first abstract generation module is configured to abstract the combined vector of the first vectors of each first type element set to obtain a first abstract of the file to be identified;

The second abstract generating module is configured to abstract the combined vector of the second vectors of each second class element set to obtain a second abstract of the file to be identified;

the index digest generation module is configured to obtain a digest of the index of the file to be identified by summarizing the combined vector of the first digest and the second digest of the file to be identified;

the comparison module is configured to compare the index abstract of the file to be identified with the index abstract of the malicious file in the preset malicious file library so as to determine whether the file to be identified is a malicious file.

According to an aspect of an embodiment of the present disclosure, an electronic device for identifying malicious files is disclosed, including: a memory storing computer readable instructions; a processor reads the computer readable instructions stored by the memory to perform the method as described above.

According to an aspect of the disclosed embodiments, a computer program medium is disclosed, on which computer readable instructions are stored which, when executed by a processor of a computer, cause the computer to perform the method as described above.

The embodiment of the disclosure calculates index digests of files to be identified and malicious files: grouping data of files (files to be identified or malicious files), establishing a file element matrix based on the grouped data, and then compressing, combining and summarizing information for multiple times on the file element matrix. And when the malicious file is identified, determining whether the file to be identified is the malicious file or not through comparison of the distinguished index digests. The method and the device can adopt a low-complexity and low-computational-power-requirement summarization algorithm when summarizing, and compress, combine and summarize information on the file element matrix for multiple times, so that the index summarization of the file obtained by the embodiment of the disclosure ensures the safety, reduces the requirement of calculating the index summarization on the computational capability of the terminal, and improves the efficiency of identifying malicious files.

Other features and advantages of the present disclosure will be apparent from the following detailed description, or may be learned in part by the practice of the disclosure.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The above and other objects, features and advantages of the present disclosure will become more apparent by describing in detail exemplary embodiments thereof with reference to the attached drawings.

FIG. 1 illustrates an architecture diagram of a method of identifying malicious files according to one embodiment of the present disclosure.

2A-2G illustrate display interface diagrams for terminals in a process of identifying malicious files, which illustrate a general process of identifying malicious files, according to one embodiment of the disclosure.

FIG. 3 illustrates a flow chart of a method of identifying malicious files according to one embodiment of the present disclosure.

FIG. 4 illustrates a flow diagram in a scenario in which a single particular malicious file is identified, according to one embodiment of the present disclosure.

FIG. 5 illustrates a flowchart for creating a malicious file repository according to one embodiment of the present disclosure.

Fig. 6 illustrates a schematic composition of information stored in a malicious file repository according to one embodiment of the present disclosure.

FIG. 7 illustrates a schematic diagram of information stored in a malicious file repository in an actual environment according to one embodiment of the present disclosure.

Fig. 8 shows a schematic diagram of an index summary list according to one embodiment of the present disclosure.

FIG. 9 illustrates an overall flow diagram for identifying malicious files according to one embodiment of the present disclosure.

Fig. 10 illustrates a communication interaction diagram of terminals according to one embodiment of the present disclosure.

FIG. 11 illustrates a schematic diagram of attacking traditional digest computation according to one embodiment of the present disclosure.

FIG. 12 illustrates a schematic diagram of index digest computation attacking an embodiment of the present disclosure, according to one embodiment of the present disclosure.

Fig. 13 illustrates a block diagram of a terminal identifying malicious files according to one embodiment of the present disclosure.

Fig. 14 illustrates a hardware structure diagram of a terminal that recognizes a malicious file according to one embodiment of the present disclosure.

Detailed Description

Example embodiments will now be described more fully with reference to the accompanying drawings. However, the exemplary embodiments may be embodied in many forms and should not be construed as limited to the examples set forth herein; rather, these example embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of the example embodiments to those skilled in the art. The drawings are merely schematic illustrations of the present disclosure and are not necessarily drawn to scale. The same reference numerals in the drawings denote the same or similar parts, and thus a repetitive description thereof will be omitted.

Furthermore, the described features, structures, or characteristics may be combined in any suitable manner in one or more example embodiments. In the following description, numerous specific details are provided to give a thorough understanding of example embodiments of the disclosure. One skilled in the relevant art will recognize, however, that the aspects of the disclosure may be practiced without one or more of the specific details, or with other methods, components, steps, etc. In other instances, well-known structures, methods, implementations, or operations are not shown or described in detail to avoid obscuring aspects of the disclosure.

Some of the block diagrams shown in the figures are functional entities and do not necessarily correspond to physically or logically separate entities. These functional entities may be implemented in software or in one or more hardware modules or integrated circuits or in different networks and/or processor devices and/or microcontroller devices.

The architecture to which embodiments of the present disclosure apply is described below first with reference to fig. 1.

Fig. 1 shows a architecture to which embodiments of the present disclosure apply: the malicious file repository 100 stores index digest values of malicious files. The identification server 200 is connected to each client 300 and the malicious file repository 100. When identifying a malicious file, the client 300 calculates an index digest value of the file to be identified according to the method shown in the embodiment of the present disclosure, uploads the index digest value of the file to be identified to the identification server 200, and the identification server 200 compares the index digest value with the malicious file library 100 to determine whether the file to be identified is a malicious file, and then returns the result to the client 300.

In one embodiment, the identification server 200 refers to a server for cloud inspection. Cloud searching means that a cloud maintains a set of malicious file library composed of malicious files such as viruses and trojans, and whether the client contains the malicious files such as viruses and trojans is judged by scanning whether the client contains the malicious files of the malicious file library in the cloud.

In one embodiment, stored in the malicious file repository is also the message digest MD5 value of the malicious file. When the identification system determines that a file to be identified is a malicious file, the MD5 value of the malicious file can be determined according to the index abstract value of the malicious file, so that the original file of the malicious file is obtained.

In an embodiment, remark information of the malicious file is stored in the malicious file library, for example, file description of the malicious file, malicious behavior characteristics of the malicious file, and the like. When the identification system determines that a file to be identified is a malicious file, remark information of the malicious file can be obtained according to the index abstract value of the malicious file.

The following describes display interface diagrams of terminals in embodiments of the present disclosure with reference to fig. 2A-2G to demonstrate a general process of identifying malicious files.

When the identification system installed at the client 300 is started, the interface of the client 300 displays the start-up reminding information as shown in fig. 2A. The identification system is an executable file installed on the client 300, capable of establishing a connection with the identification server 200, and serving as a communication interface between the identification server 200 and the client 300.

After the identification system installed at the client 300 is started, connection to the identification server 200 is started, the interface of the client 300 displays connection reminding information as shown in fig. 2B, and the user can confirm connection to the identification server 200 by clicking a "ok" button on the interface.

After the client 300 connects to the recognition server 200, the recognition server 200 displays a series of operations performed by the recognition server 200 in the process of establishing a connection with the client 300 as shown in fig. 2C: receiving a connection request, and authenticating a client.

After the connection with the identification server 200 is successful, the interface of the client 300 displays connection success reminding information as shown in fig. 2D, reminds the user to specify a scanning position, and displays an option button providing a capability of scanning different positions. The user may designate the corresponding scan position by clicking a button on the interface. Further, if the user clicks a "specify path" button on the interface, the interface of the client 300 displays a path specification interface as shown in fig. 2E, from which the user can specify a path of a file to be scanned.

After confirming the scan, the interface of the client 300 displays a scan reminder as shown in FIG. 2F, and the user can confirm the scan of the identification system by clicking the "ok" button on the interface.

After the scanning is completed, if the file virus.exe is identified as a virus file, the interface of the client 300 displays the identification result information as shown in fig. 2G.

It should be noted that the above terminal interface diagrams are merely exemplary illustrations made for illustrating the general processes of embodiments of the present disclosure, and should not be taken to limit the functionality or scope of the present disclosure.

The process of the embodiments of the present disclosure is described in detail below.

As shown in fig. 3, according to one embodiment of the present disclosure, there is provided a method of identifying a malicious file, including:

step 310, grouping data of the file to be identified according to a predetermined data grouping rule, wherein each group of data is taken as one element in the element matrix of the file to be identified to form the element matrix of the file to be identified;

step 320, generating a first vector of the first type element set based on each first type element set in the to-be-identified file element matrix, and generating a second vector of the second type element set based on each second type element set in the to-be-identified file element matrix, wherein the first type element set and the second type element set are determined by dividing elements in the to-be-identified file element matrix according to a predetermined element dividing rule;

Step 330, summarizing the combined vector of the first vectors of each first class element set to obtain a first summary of the file to be identified;

step 340, summarizing the combined vectors of the second vectors of each second class element set to obtain a second summary of the file to be identified;

step 350, summarizing the combined vector of the first summary and the second summary of the file to be identified to obtain an index summary of the file to be identified;

step 360, comparing the index abstract of the file to be identified with the index abstract of the malicious file in the preset malicious file library, so as to determine whether the file to be identified is a malicious file.

Malicious files refer to behavior files with malicious behaviors, such as files for illegally capturing network data, files for illegally acquiring protected data, files for maliciously injecting into other processes, and the like.

The first class element set and the second class element set refer to element sets obtained by dividing elements in a file element matrix to be identified according to a preset element division rule. For example, in one embodiment, the matrix of file elements to be identified is:

if the predetermined element dividing rule is to divide the elements of each row into corresponding first class element sets and divide the elements of each column into corresponding second class element sets, 4 first class element sets are obtained: a [ a11, a12, a13, a14 ], [ a21, a22, a23, a24 ], [ a31, a32, a33, a34 ], [ a11, a12, a13, a14 ], and 4 sets of second class elements: a11, a21, a31, a41, [ a12, a22, a32, a42 ], a13, a23, a33, a43, [ a14, a24, a34, a44 ].

If the predetermined element dividing rule is to divide the element parallel or coincident with the main diagonal line into corresponding first class element sets and divide the element parallel or coincident with the auxiliary diagonal line into corresponding second class element sets, 7 first class element sets are obtained: 【a41】 A31, a42, [ a21, a32, a43 ], a11, a22, a33, a44, [ a12, a23, a34 ], a13, a24, [ a14 ], and 7 sets of second class elements: 【a11】 The method comprises the following steps of (1) carrying out (1) a process of (1) 12, a21, (a 13, a22, a 31), (a 14, a23, a32, a 41), (a 24, a33, a 42), (a 34, a 43) and (a 44).

It should be noted that the embodiments are merely exemplary illustrations and should not be construed as limiting the function and scope of the disclosure.

The abstract is information in which an input/output of an arbitrary length is pseudo-random and fixed in length. The general digest algorithm can be divided into two main categories according to whether encryption is performed or not, and one is the encryption digest algorithm: for example, the SHA algorithm, secure Hash Algorithm secure Hash algorithm, the five algorithms of the SHA family are SHA-1, SHA-224, SHA-256, SHA-384 and SHA-512, respectively; the MD5 Algorithm, namely MD5Message-Digest Algorithm, can output a 128-bit Digest value; another non-cryptographic digest algorithm, such as the CRC32 algorithm, i.e., one of the Cyclic redundancy check cyclic redundancy check algorithms, is capable of outputting a 32-bit digest value.

It should be noted that, the identification object of the embodiment of the present disclosure may be a PE file, that is, a Portable Executable portable executable file. Common EXE, DLL, OCX, SYS, COM are PE files, which are program files on the Microsoft Windows operating system (possibly indirectly executed, such as DLLs).

In the embodiment of the disclosure, after receiving a file to be identified, an identification system groups data of the file to be identified, and establishes an element matrix of the file to be identified based on each group of data obtained after grouping. Determining a first vector of each first type element set and a second vector of each second type element set in the element matrix of the file to be identified, and further performing summary operation on the basis of the first vector and the second vector to obtain an index summary of the file to be identified. Comparing the index abstract of the identified file with the index abstract of the malicious file in a preset malicious file library, thereby determining whether the file to be identified is a malicious file.

The process of creating a matrix of elements of a document to be identified is described below with respect to data grouping of the document to be identified.

In step 310, the data of the file to be identified is grouped according to a predetermined data grouping rule, and each group of data is divided into one element in the element matrix of the file to be identified, so as to form the element matrix of the file to be identified.

In an embodiment, the data of the file to be identified includes bytes in a binary representation of the file to be identified, the data of the file to be identified is grouped according to a predetermined data grouping rule, each group of data is divided into one element in a matrix of elements of the file to be identified, and the element matrix of the file to be identified is formed, including:

taking out each preset number of bytes from the binary representation of the file to be identified as an element in sequence, and filling the bytes into a blank matrix with preset matrix width in sequence from top to bottom and from left to right.

In this embodiment, the byte size of each element in the matrix is preset, after the file to be identified is obtained, the identification system sequentially takes out the binary data of the obtained file to be identified, the data with the preset byte size is taken as an element, and the binary data is filled into a blank matrix with the preset matrix width from top to bottom and from left to right.

For example, each element in the preset matrix is 1 byte in size, and the preset matrix width is 3. The binary data of the obtained file to be identified is 00111001011101010010101011111001.. 10110110. The identification system sequentially fetches 1 byte size data from the binary data, resulting in element 1:00111001, element 2:01110101, element 3:00101010, element 4:11111001. Element 12:10110110.

Filling the 12 elements into a blank matrix with the matrix width of 3 according to the sequence from top to bottom and from left to right to obtain a file element matrix to be identified:

by the method, the element matrix of the file to be identified containing all data of the file to be identified is obtained, so that the subsequent processing of the element matrix of the file to be identified can be equivalent to the processing of the file to be identified.

In an embodiment, the data of the file to be identified includes characters of the file to be identified, the data of the file to be identified is grouped, each group of data is divided into one element in the element matrix of the file to be identified, and the element matrix of the file to be identified is formed, including:

and converting each character with a preset number into a bit string in sequence from the file to be identified, taking out the bit string as an element, and filling the bit string into a blank matrix of a preset matrix broadband in sequence from top to bottom and from left to right.

In this embodiment, the number of characters of each element in the matrix is preset, after the file to be identified is obtained, the identification system sequentially takes out the obtained character data of the file to be identified, converts the data of the preset number of characters into binary bit strings, and fills the binary bit strings into a blank matrix with a preset matrix width as an element according to the sequence from top to bottom and from left to right.

For example, the number of characters per element in the matrix is preset to 1, and the preset matrix width is set to 3. The character data of the file to be identified coded in ASCII is obtained as follows: a2fg..a..h, after the recognition system sequentially fetches 1 character and converts to a binary bit string (fetches 'a' and converts to a binary bit string 01100001, fetches '2' and converts to a binary bit string 00110010), element 1 is obtained: 01100001, element 2:00110010, element 3:01000110, element 4: 0100111..: 01001000.

In one embodiment, after filling the blank matrices with the preset matrix width in the order from top to bottom and from left to right, the method further comprises:

And if all the elements of the file to be identified are taken out, filling a preset placeholder in an unfilled position in the blank matrix.

In this embodiment, before the file to be identified is received, it is uncertain that the data of the file to be identified can be divided into several groups, so after each group of data obtained by grouping the data of the file to be identified is used as an element to fill a blank matrix with a preset matrix width, a situation may occur that the blank matrix cannot be filled. At this request, the positions in the blank matrix that are filled with a predetermined placeholder (e.g., bit 0) to fill the blank matrix intact.

For example, the preset matrix width is 3, and after grouping binary data of the file to be identified, element 1 is obtained: 00111001, element 2:01110101, element 3:00101010, element 4: 11111001..: 10110110.

filling the 11 elements into a blank matrix with the matrix width of 3 according to the sequence from top to bottom and from left to right to obtain an unfilled element matrix of the file to be identified:

at this time, in order to fill up the matrix of the file elements to be identified, bit 0 is filled in the unfilled position in the matrix, so as to obtain the filled matrix of the file elements to be identified:

By the method, the aim of obtaining the complete matrix of the file elements to be identified on the premise of not affecting the properties of the matrix of the file elements to be identified is fulfilled.

In an embodiment, after grouping the data of the file to be identified, each group of data is taken as one element in the element matrix of the file to be identified, and after forming the element matrix of the file to be identified, the method further includes:

introducing a first confusion factor into a file element matrix to be identified, wherein the first confusion factor comprises at least one of the following:

splicing preset bits after each element in the element matrix of the file to be identified;

and adding preset element rows at preset row number positions in the element matrix of the file to be identified.

The obfuscation factor refers to the portion of data that is used to enhance the overall data hashing effect, and may be self-structured by the user based on the requirements for the hashing effect. The first confusion factor refers to a confusion factor introduced in a file element matrix to be identified in the embodiment of the disclosure.

Because the element matrix of the file to be identified is the basis for generating the index digest of the file to be identified, the hash effect of the element matrix of the file to be identified directly affects the security of the generated index digest. In order to improve the security of the finally generated index abstract, data of the file to be identified are grouped to form a file element matrix to be identified, and then a first confusion factor is introduced into the file element matrix to be identified so as to enhance the hash effect of the file element matrix to be identified.

In one embodiment, introducing a first obfuscation factor in a matrix of file elements to be identified includes: and splicing preset bits after each element in the element matrix of the file to be identified. The predetermined bit is a first aliasing factor.

For example, the first aliasing factor is bit 01. The identification system groups binary data of the file to be identified, fills each element into a blank matrix, and then obtains an element matrix of the file to be identified:

in order to enhance the hash effect of the element matrix of the file to be identified, splicing preset bits 01 after each element to obtain the element matrix of the file to be identified:

the embodiment has the advantage that the hash effect of the element matrix of the file to be identified is better by introducing the first confusion factor, so that the security of subsequent digest calculation is higher.

In one embodiment, introducing a first obfuscation factor in a matrix of file elements to be identified includes: and adding preset element rows at preset row number positions in the element matrix of the file to be identified. The preset element row is the first confusion factor.

For example, the first aliasing factor is a row of elements (11110000 11110000 11110000) added below the second row of the matrix of elements of the document to be identified. The identification system groups binary data of the file to be identified, fills each element into a blank matrix, and then obtains an element matrix of the file to be identified:

In order to enhance the hash effect of the file element matrix to be identified, an element row (11110000 11110000 11110000) is added below the second row, so as to obtain the file element matrix to be identified:

The specific process of generating the first vector of the first class element set and the second vector of the second class element set after the element matrix of the file to be identified is acquired is described below.

In step 320, a first vector of the first type element set is generated based on each first type element set in the to-be-identified file element matrix, and a second vector of the second type element set is generated based on each second type element set in the to-be-identified file element matrix, where the first type element set and the second type element set are determined by dividing elements in the to-be-identified file element matrix according to a predetermined element division rule.

As can be seen from the above description of the division of the first class element set and the second class element set, the division may be performed according to the rows or columns of the file element matrix to be identified, so as to obtain each first class element set and each second class element set; the method can also divide the file element matrix to be identified according to the diagonal line of the file element matrix to be identified, so as to obtain each first type element set and each second type element set. For the sake of brief description, description of the subsequent processes of the embodiments of the present disclosure is described with respect to the subsequent processes in such a manner that the rows or columns of the file element matrix to be identified are divided. It is apparent that the following procedure of the embodiment of the present disclosure divided according to the diagonal of the element matrix of the document to be identified, or the following procedure of the embodiment of the present disclosure divided according to other methods is the same, and thus will not be described in detail.

In an embodiment, determining each row element in the file element matrix to be identified as a corresponding first type element set according to a predetermined element partitioning rule, determining each column element in the file element matrix to be identified as a corresponding second type element set, generating a first vector of the first type element set based on each first type element set in the file element matrix to be identified, and generating a second vector of the second type element set based on each second type element set in the file element matrix to be identified, including:

Generating a row vector of a row based on elements of the row in the element matrix of the file to be identified, and generating a column vector of a column based on elements of the column in the element matrix of the file to be identified.

In this embodiment, the elements of each row in the file element matrix to be identified are determined as the corresponding first type element set, and the elements of each column in the file element matrix to be identified are determined as the corresponding second type element set. Generating a corresponding first vector based on the first type element set, namely generating a row vector of the row based on the elements of the row; a corresponding second vector is generated based on the second set of elements, i.e. a row vector for the row is generated based on the elements of the column.

In an embodiment, generating a row vector of a row based on an element of the row in the file element matrix to be identified, and generating a column vector of a column based on an element of the column in the file element matrix to be identified, includes:

solving the exclusive OR of each element of the row in the element matrix of the file to be identified to obtain a row vector of the row;

and solving the exclusive OR of each element in the column in the element matrix of the file to be identified to obtain the row vector of the column.

In this embodiment, when generating the row vectors of each row and the column vectors of each column in the element matrix of the file to be identified, the elements of each row are xored to obtain the corresponding row vectors, or the elements of each column are xored to obtain the corresponding column vectors.

For example, the element matrix of the file to be identified is:

when generating the row vector of the first row, exclusive-or is performed on each element of the first row: the row vector of the first row is obtained as (0100110). Lines of other linesThe generation of the vector and the generation of the column vector of each column are similar to each other, and thus are not described in detail herein.

The embodiment has the advantages that compression of matrix information is achieved through exclusive OR of elements, and the efficiency of subsequent matrix processing is improved.

solving the exclusive OR of the odd-numbered elements of the rows in the element matrix of the file to be identified to obtain a first exclusive OR vector;

Solving the exclusive OR of the even number of elements of the row in the element matrix of the file to be identified to obtain a second exclusive OR vector;

and cascading the first exclusive-or vector and the second exclusive-or vector to obtain a row vector of the column.

In this embodiment, when generating the row vector of each row in the element matrix of the file to be identified, the odd-numbered elements of each row are xored to obtain a first xored vector, the even-numbered elements of each row are xored to obtain a second xored vector, and the first xored vector and the second xored vector are cascaded to obtain the row vector of the corresponding row.

For example, the element matrix of the file to be identified is:

when generating the row vector of the first row, exclusive or is performed on each odd-numbered element of the first row: obtain a first exclusive OR direction of the first rowAn amount (00010011); exclusive-or each even-numbered element of the first row:a second exclusive or vector (001110101) of the first row is obtained. The two are concatenated to obtain a row vector (00010011001110101) for the first row. The generation of the row vectors of the other rows and the generation of the column vectors of the respective columns are similar to each other, and thus are not described in detail herein.

The embodiment has the advantages that the compression of matrix information is realized and the subsequent matrix processing efficiency is improved by exclusive-or of each odd row element, each odd column element, each even row element and each even column element.

The specific process of generating the index digest on the basis of the first vector of each first-type element set and the second vector of each second-type element set is described below.

In step 330, the combined vector of the first vectors of each first class element set is summarized, so as to obtain the first summary of the file to be identified.

The combination vector of the first vectors of the first-type element sets refers to a vector obtained by combining together the first vectors of the first-type element sets.

In an embodiment, determining each row of elements in the element matrix of the file to be identified as a corresponding first type of element set according to a predetermined element partitioning rule, determining each column of elements in the element matrix of the file to be identified as a corresponding second type of element set, summarizing a combined vector of first vectors of the first type of element sets, and obtaining a first summary of the file to be identified, where the summarizing includes:

and summarizing the combined vector of the row vectors of each row to obtain a first summary of the file to be identified.

The combination vector of the row vectors of the respective rows means a vector obtained by combining the row vectors of the respective rows together.

In step 340, the combined vector of the second vectors of each second class element set is summarized, so as to obtain a second summary of the file to be identified.

The combined vector of the second vectors of the respective second-class element sets means a vector obtained by combining together the second vectors of the respective second-class element sets.

In an embodiment, determining each row of elements in the element matrix of the file to be identified as a corresponding first type of element set according to a predetermined element partitioning rule, determining each column of elements in the element matrix of the file to be identified as a corresponding second type of element set, summarizing a combined vector of second vectors of the first type of element sets, and obtaining a second summary of the file to be identified, where the summarizing includes:

and summarizing the combined vector of the column vectors of each column to obtain a second summary of the file to be identified.

The combination vector of column vectors of each column means a vector obtained by combining together column vectors of each column.

In one embodiment, the combination vector of the row vectors of each row is a vector formed by concatenating the row vectors of each row, and the combination vector of the column vectors of each column is a vector formed by concatenating the column vectors of each column.

In the embodiment, after generating row vectors of each row and column vectors of each column of the element matrix of the file to be identified, cascading the row vectors of each row to obtain a combined vector of the row vectors of each row; and cascading the column vectors of each column to obtain a combined vector of the column vectors of each column.

For example, generating a row vector of a first row (0100110), a row vector of a second row (10111100), a row vector of a third row (11011000) and a row vector of a fourth row (00001011) of the element matrix of the file to be identified, and cascading the four row vectors to obtain a combined vector of the row vectors of each row (01100110, 10111100, 11011000, 00001011); the first column of the element matrix of the file to be identified has a column vector (10111000), the second column has a column vector (10001101), the third column has a column vector (00110100), and the three column vectors are cascaded to obtain a combined vector (10111000, 10001101, 00110100) of the column vectors of each column.

The embodiment has the advantage that the row vectors and the column vectors are respectively combined to integrate the information contained in the element matrix of the file to be identified, so that the element matrix can be uniformly processed.

In one embodiment, the combination vector of the row vectors of each row is a vector formed by concatenating the row vectors of the odd numbered rows and a vector formed by concatenating the row vectors of the even numbered rows, and the combination vector of the column vectors of each column is a vector formed by concatenating the column vectors of the odd numbered columns and a vector formed by concatenating the column vectors of the even numbered columns.

In the embodiment, after generating row vectors of each row and column vectors of each column of the element matrix of the file to be identified, vectors formed by cascading row vectors of each odd row and vectors formed by cascading row vectors of each even row are cascaded, so that a combined vector of the row vectors of each row is obtained; the vector formed by concatenating the column vectors of each odd-numbered column and the vector formed by concatenating the column vectors of each even-numbered column are concatenated, thereby obtaining a combined vector of the column vectors of each column.

For example, a first row of the matrix of file elements to be identified is generated as row vector (0100110), a second row as row vector (10111100), a third row as row vector (11011000), and a fourth row as row vector (00001011). The row vectors of the first row and the row vectors of the third row are cascaded to obtain (01100110, 11011000), the row vectors of the second row and the row vectors of the fourth row are cascaded to obtain (10111100, 00001011), and the row vectors of the second row and the row vectors of the fourth row are cascaded to obtain a combined vector (01100110, 11011000, 10111100, 00001011); the column vector of the first column of the file element matrix to be identified is (10111000), the column vector of the second column is (10001101), the column vector of the third column is (00110100), the column vector of the first column and the column vector of the third column are cascaded to obtain (10111000, 00110100), and only one even column is provided, so that the column vector of the second column is obtained after the even column is cascaded (10001101), and the column vectors of the first column and the third column are cascaded to obtain the combined vector of the column vectors of each column (10111000, 00110100, 10001101).

The embodiment has the advantages that the odd row vectors and the even row vectors are combined, the odd column vectors and the even column vectors are combined, and the integration of information contained in the element matrix of the file to be identified is realized, so that the element matrix to be identified can be uniformly processed, and is not easy to attack, and the safety of subsequent abstract calculation is improved.

In the embodiment of the disclosure, after a combined vector of a first vector is obtained, a summary of the combined vector of the first vector is obtained, and a first summary of the file to be identified is obtained; and obtaining a second summary of the file to be identified by summarizing the combined vector of the second vector after obtaining the combined vector of the second vector.

Specifically, when the first class element set and the second class element set are partitioned according to rows or columns of the file element matrix to be identified: after obtaining a combined vector of the row vectors, summarizing the combined vector of the row vectors to obtain a first summary of the file to be identified; and obtaining a second summary of the file to be identified by summarizing the combined vector of the column vectors after obtaining the combined vector of the column vectors.

For example, the resulting row vector has a combined vector of (01100110, 10111100, 11011000, 00001011), which is hashed: hash (01100110, 10111100, 11011000, 00001011) to obtain a first digest of the file to be identified; the resulting combined vector of column vectors is (10111000, 10001101, 00110100), which is hashed: hash (10111000, 10001101, 00110100) to obtain a second digest of the file to be identified.

In an embodiment, the digest algorithm used in the first digest and the digest algorithm used in the second digest may be the same digest algorithm or different digest algorithms.

The embodiment has the advantages that the first digest and the second digest which are obtained are difficult to attack by self configuration of the digest algorithm used when the first digest is obtained and the digest algorithm used when the second digest is obtained, so that the digest calculation safety is improved.

In one embodiment, the digest algorithm used in the first digest or the digest algorithm used in the second digest is a consistency digest algorithm that satisfies consistency.

The consistency digest algorithm is a distributed algorithm, and uniformly distributes key-va l ue to a plurality of Hash nodes. The process of the digest algorithm is considered as a mapping process: mapping various original messages into a space with a fixed size, the consistency digest algorithm can uniformly map, so that the conflict of mapping is reduced. Namely, a consistency digest algorithm is used to find that the digest meets the requirement of low conflict rate.

In one embodiment, the digest algorithm used in the first digest or the digest algorithm used in the second digest is Jump Consistent Hash algorithm, which has zero memory consumption, is uniformly and quickly distributed, and is suitable for use in a distributed system.

The embodiment has the advantages that by using the consistency digest algorithm, the conflict rate of the digest is reduced, and the security of digest calculation is improved.

In an embodiment, before summarizing the combined vector of the first vectors of each first type of element set and summarizing the combined vector of the second vectors of each second type of element set, the method further comprises: introducing a second confusion factor into the combined vector of the first vectors of the first element sets, and introducing a second confusion factor into the combined vector of the second vectors of the second element sets.

The second aliasing factor refers to the aliasing factor introduced in the combined vector in the embodiments of the present disclosure.

Specifically, when the first class element set and the second class element set are partitioned according to rows or columns of the file element matrix to be identified, introducing a second confusion factor into a combined vector of the first vectors of each first class element set, and introducing a second confusion factor into a combined vector of the second vectors of each second class element set, including:

And introducing a second aliasing factor into the combined vector of the row vectors of each row and introducing the second aliasing factor into the combined vector of the column vectors of each column.

In one embodiment, introducing a second aliasing factor into the combined vector of the row vectors of the rows comprises: and splicing preset bits after each row vector in the combined vector. Wherein the predetermined bit is a second aliasing factor.

For example, the second aliasing factor is bit 01. The recognition system obtains a combined vector (01100110, 10111100, 11011000, 00001011) of the row vectors of the rows of the matrix of document elements to be recognized, and introduces a second aliasing factor into the combined vector before summarizing the combined vector to obtain a first summary, thereby obtaining a combined vector (0110011001, 1011110001, 1101100001, 0000101101) of the row vectors of the rows incorporating the second aliasing factor.

This embodiment has the advantage that by introducing the second aliasing factor, the hashing effect of the combined vector is better, and thus the security of the subsequent digest calculation is higher.

In one embodiment, introducing a second aliasing factor into the combined vector of the row vectors of the rows comprises: and adding a preset vector to a preset position in the combined vector. The predetermined vector is the second aliasing factor.

For example, the second aliasing factor is a vector added after the 2 nd row vector in the combined vector (11110000). The recognition system obtains a combined vector (01100110, 10111100, 11011000, 00001011) of the row vectors of the rows of the matrix of document elements to be recognized, and introduces a second aliasing factor into the combined vector before summarizing the combined vector to obtain a first summary, thereby obtaining a combined vector (01100110, 10111100, 11110000, 11011000, 00001011) of the row vectors of the rows incorporating the second aliasing factor.

It is obvious that the implementation process of introducing the second aliasing factor into the combination vector of the column vectors of each column is the same as the implementation process of introducing the second aliasing factor into the combination vector of the row vectors of each row, and thus will not be described herein.

The specific process of obtaining the index digest of the file to be identified after obtaining the first digest and the second digest of the file to be identified is described below.

In step 350, the combined vector of the first digest and the second digest of the file to be identified is digested, and an index digest of the file to be identified is obtained.

In an embodiment, the combined vector of the first digest and the second digest refers to a vector obtained by concatenating the first digest and the second digest.

For example, the first digest of the file to be identified is 88y7dF984f40ks20, the second digest is aj87324sVnkj2j00, and the combined vector of the first digest and the second digest is (88 y7dF984f40ks20, aj87324sVnkj2j 00).

In an embodiment, the abstract algorithm used in the indexing abstract and the abstract algorithm used in the first abstract may be the same abstract algorithm or different abstract algorithms; the digest algorithm used in indexing the digest and the digest algorithm used in indexing the second digest may be the same digest algorithm or may be different digest algorithms.

The embodiment has the advantages that the obtained index digest is difficult to attack by self-configuration of the digest algorithm used in the process of obtaining the first digest, the digest algorithm used in the process of obtaining the second digest and the digest algorithm used in the process of obtaining the index digest, so that the security of digest calculation is improved.

In one embodiment, the digest algorithm used in indexing the digests is a consistency digest algorithm that satisfies consistency.

The embodiment has the advantages that by using the consistency digest algorithm, the conflict rate of index digests is reduced, and the security of digest calculation is improved.

In one embodiment, prior to summarizing the combined vector of the first digest and the second digest of the file to be identified, the method comprises: a third aliasing factor is introduced in the combined vector of the first digest and the second digest.

The third aliasing factor refers to an aliasing factor introduced in the combination vector of the first digest and the second digest in the embodiment of the disclosure.

In one embodiment, introducing a third aliasing factor into the combined vector of the first digest and the second digest comprises: and splicing preset bytes after the first digest and the second digest in the cascade vector. The predetermined byte is a third aliasing factor.

For example, the third aliasing factor is byte 4a. The identification system obtains a first abstract of 88y7dF984f40ks20 and a second abstract of aj87324sVnkj2j00 of the file to be identified, and the combined vector of the first abstract and the second abstract is (88 y7dF984f40ks20, aj87324 sVnkj2j00). A third aliasing factor is introduced into the combined vector before the combined vector is digested to obtain an index digest, thereby obtaining a combined vector (88 y7df984f40ks204a, aj87324 svnkj2j004a) of the first digest and the second digest into which the third aliasing factor is introduced.

The advantage of this embodiment is that by introducing the third aliasing factor, the hashing effect of the combined vector of the first digest and the second digest is better, so that the security of the subsequent digest calculation is higher.

In one embodiment, introducing a third aliasing factor into the combined vector of the first digest and the second digest comprises: and adding a preset byte group at a preset position in the combined vector. The predetermined tuple is a third aliasing factor.

For example, the third aliasing factor is the tuple ffvbbbggkkuuii 77 added before the first digest. The recognition system obtains a combined vector (88 y7df984f40ks20, aj87324svnkj2j00) of the first digest and the second digest, and introduces a third aliasing factor into the combined vector before summarizing the combined vector to obtain an index digest, thereby obtaining a combined vector (ffvbbgkkuuii 77, 88y7df984f40ks20, aj87324svnkj2j00) of the first digest and the second digest in which the third aliasing factor is introduced.

The specific process of determining whether a file to be identified is a malicious file after obtaining the index abstract of the file to be identified is described below.

In step 360, the index digest of the file to be identified is compared with the index digests of the malicious files in the preset malicious file library, so as to determine whether the file to be identified is a malicious file.

In the embodiment of the disclosure, index summaries of various malicious files are prestored in a malicious file library, and whether the file to be identified is a malicious file is determined by comparing the index summaries of the file to be identified with the index summaries of the malicious files. After the identification system installed at the client calculates the index abstract of the file to be identified through the method, the index abstract of the file to be identified is reported to the identification server, the identification server compares the index abstract of the file to be identified with the index abstract of each malicious file in the malicious file library, and then the comparison result is returned to the identification system installed at the client.

In one embodiment, the full text digest of the malicious file is stored in the malicious file library corresponding to the index digest, the full text digest is obtained by applying a predetermined digest algorithm to the full text of the malicious file, the full text digest is used for searching the original file of the malicious file,

Comparing the index abstract of the file to be identified with the index abstract of the malicious file in a preset malicious file library, thereby determining whether the file to be identified is a malicious file, including: if the index abstract of the file to be identified is matched with the index abstract of one malicious file in a preset malicious file library, determining whether the file to be identified is a malicious file or not;

after comparing the index abstract of the file to be identified with the index abstract of the malicious file in the preset malicious file library, so as to determine whether the file to be identified is a malicious file, the method further comprises:

and searching the original file of the malicious file by utilizing the full text abstract stored corresponding to the matched index abstract.

In this embodiment, the index abstract of each malicious file and the full text abstract (e.g., MD 5) of the corresponding malicious file are stored in the malicious file library, and the full text abstract is used to find the original file of the malicious file. After the index abstract of the file to be identified is calculated, the index abstract of the file to be identified is compared with the index abstract of each malicious file stored in the malicious file library, if the index abstract of the file to be identified is matched with the index abstract of one malicious file stored in the malicious file library, the file to be identified is determined to be a malicious file, and the file to be identified is a specific malicious file corresponding to the index abstract of the malicious file. This process may be referred to as the file to be identified hitting the malicious file.

If the original file of the malicious file needs to be searched, the original file of the malicious file can be searched according to the full text digest (e.g., MD 5) corresponding to the matched index digest.

The method has the advantages that whether the file to be identified is a malicious file or not can be quickly determined and the original file corresponding to the malicious file can be quickly found by establishing an index abstract and a full text abstract which store the malicious files.

The specific process of obtaining the index digest of a malicious file in the malicious file library is described below.

In an embodiment, the index digest of the malicious file in the preset malicious file library is obtained in advance by the following way:

grouping data of malicious files according to a preset data grouping rule, wherein each group of data is divided into one element in a malicious file element matrix to form the malicious file element matrix;

generating a first vector of each first type element set based on each first type element set in the malicious file element matrix, and generating a second vector of each second type element set based on each second type element set in the malicious file element matrix, wherein the first type element set and the second type element set are determined by dividing elements in the malicious file element matrix according to a preset element dividing rule;

Summarizing the combined vector of the first vectors of the first class element sets to obtain a first summary of the malicious file;

summarizing the combined vector of the second vectors of each second class element set to obtain a second summary of the malicious file;

summarizing the combined vector of the first summary and the second summary of the malicious file to obtain an index summary of the malicious file.

In the embodiment of the disclosure, in order to ensure that the index digest of the file to be identified can correctly hit the index digest of the malicious file, the process of calculating the index digest of the malicious file is required to be completely consistent with the process of calculating the index digest of the file to be identified: the process of data grouping of files to be identified is consistent with the process of data grouping of malicious files; the elements in the file element matrix to be identified and the elements in the malicious file element matrix contain data with the same size; the summarization algorithm for summarizing the combined vector of the first vectors of the first class element sets in the element matrix of the file to be identified is the same as the summarization algorithm for summarizing the combined vector of the first vectors of the first class element sets in the element matrix of the malicious file; the digest algorithm for summarizing the combined vector of the first digest and the second digest of the file to be identified is the same as the digest algorithm for summarizing the combined vector of the first digest and the second digest of the malicious file.

Further, the width of the preset file element matrix to be identified is the same as the width of the preset malicious file element matrix; if the first confusion factor, the second confusion factor and the third confusion factor are introduced when the index abstract of the malicious file is calculated, the first confusion factor, the second confusion factor and the third confusion factor which are completely consistent with each other are correspondingly introduced when the index abstract of the file to be identified is calculated.

For example, when calculating an index digest of a malicious file: sequentially extracting 1-byte-sized data from binary data of a malicious file as elements; the width of a preset malicious file element matrix is 3; splicing bit 01 after each element in the malicious file element matrix; the summarization algorithm for summarizing the combined vector of the row vectors of each row in the malicious file element matrix is Jump Consistent Hash; the digest algorithm for summarizing the combined vector of the first digest and the second digest of the malicious file is Jump Consistent Hash.

Then, when calculating the index abstract of the file to be identified: sequentially taking out 1 byte size data from binary data of a file to be identified as elements; the width of a preset file element matrix to be identified is 3; splicing bit 01 after each element in the element matrix of the file to be identified; the summarization algorithm for summarizing the combined vector of the row vectors of each row in the element matrix of the file to be identified is Jump Consistent Hash; the digest algorithm for summarizing the combined vector of the first digest and the second digest of the file to be identified is Jump Consistent Hash.

In summary, in a practical implementation process of the embodiment of the present disclosure, a process of calculating an index digest of a file to be identified is completely consistent with a process of calculating an index digest of a malicious file, so that the malicious file can be correctly hit by the file to be identified only in a process of identifying the file to be identified.

The following describes the main process of calculating the index digest of a file in the embodiment of the present disclosure from the technical level, so as to more fully demonstrate the process of obtaining the index digest.

Assuming that the byte length of the file is L, the file is first grouped. Assuming that each element occupies b (e.g., b=4) bytes, there are a total of L/b elements, and then the elements are filled into the matrix M in such a way that n elements are one row in the matrix M. To enhance the hashing effect, a first aliasing factor may be introduced, and then the unfilled locations are filled with 0, thus yielding a matrix M of m=ceil (L/(n×b)) rows altogether, as follows:

exclusive or each row and each column in the matrix M to obtain row vector x of each row ₁ ..x _m Column vector y for each column ₁ ...y _n And further obtains a combination vector (x ₁ ...x _m ) A combination vector of column vectors of the respective columns (y ₁ ...y _n ). The following is shown:

wherein:

Selecting a hash function (preferably a hash function meeting consistency) to calculate summary information for each combination vector of row vectors of each row and each combination vector of column vectors of each column to obtain a first summary d _x And the second abstract d _y . The following is shown:

d _x ＝Hash(x ₁ ,x ₂ ,…,x _m )

d _y ＝Hash(y ₁ ,y ₂ ,…,y _n )

before calculating the first digest and the second digest, a second aliasing factor may be introduced into a combination vector of row vectors of each row and a combination vector of column vectors of each column.

For the first abstract d _x And the second abstract d _y A third confusion factor sigma may be introduced when necessary to obtain the final required index Digest. The following is shown:

Digest＝Hash(d _x ,d _y ,σ)

the process of identifying malicious files according to embodiments of the present disclosure is described below in connection with a specific application scenario.

In one embodiment, a user reports a file requesting identification of whether the file is a particular malicious file. To determine whether the file is the particular malicious file, embodiments of the present disclosure perform a process as shown in FIG. 4: calculating an index abstract of the file, calculating an index abstract of the malicious file, comparing the index abstract with the index abstract of the malicious file, and if the index abstract and the index abstract are the same, indicating that the file hits the malicious file, namely the file is really the specific malicious file, so as to hit the file; if the two files are different, the file is not hit in the malicious file, namely the file is not the malicious file, and the identification process is finished.

The embodiment has the advantages that the identification is carried out aiming at the specific malicious files, and the identification speed and accuracy are improved.

In an embodiment, in a cloud searching scenario, it is required to determine whether a batch of files reported by a user contains malicious files. At this time, a malicious file library to be compared needs to be established in the cloud in advance, and then index abstract values of all files reported by a user are compared with the malicious file library, so that whether the files reported by the user contain malicious files is determined.

The method and the device have the advantages that files in the client are identified in batches, and the coverage rate of identification is improved.

In one embodiment, as shown in fig. 5, the process of creating a malicious file repository: the recognition server performs background analysis (for example, analysis on malicious behavior characteristics of the files) on a batch of files to be analyzed in advance, and determines malicious files from the files. And calculating an index abstract of each malicious file, and storing the index abstract of the malicious file in a cloud database, so as to establish a malicious file library.

In an embodiment, the index abstract of the known malicious file may be calculated and stored in a malicious file library; the method can also open a malicious file reporting interface to a user, analyze and screen the malicious files reported by the user, calculate an index abstract of the determined malicious files and store the index abstract in a malicious file library; the method can also be used for screening the reported suspicious files based on the identification method/model of other malicious files, and storing the determined malicious files in a malicious file library after calculating an index abstract.

In one embodiment, the information of the malicious files stored in the malicious file repository mainly includes two parts, as shown in fig. 6: index digest, MD5. The index abstract is used as a key field (a field for searching) of a malicious file library and is used for comparing with the index abstract of the file to be identified; MD5 is used to find the original file of a malicious file after the index digest of the file to be identified hits the index digest of the malicious file.

In one embodiment, the information in the malicious file repository is stored using a key-value structure. The malicious file library stored by adopting the key-value structure can meet the query request of millions to tens of millions per second.

The embodiment has the advantage that the information of the malicious files is stored by adopting a key-value structure, so that the searching comparison of the malicious file library can be quickly performed later.

In an embodiment, various remark information of the malicious file, such as file description of the malicious file and malicious behavior characteristics of the malicious file, is also stored in the malicious file library.

The embodiment has the advantage that the remark information of the malicious file is stored, so that the malicious file can be further processed more pertinently after the malicious file is hit.

In one embodiment, as shown in FIG. 7: the malicious file library actually stores an index abstract value of 8 bytes in size and an MD5 of 16 bytes in size of the malicious file, which are separated by "|".

In one embodiment, the malicious file library is deployed to the cloud after being established. After an identification system as an interface between an identification server and a client is installed on the client, the client is connected to the identification server, and enumeration of files of a specified path of the client is started. The identification system in the client can enumerate the files and calculate the index digests thereof, store the index digests in the index digest list, and report the index digests to the identification server. Fig. 8 shows part of the content of the index summary list in this embodiment.

In the actual implementation process of the embodiment of the disclosure, since the file of the client will change at any time, the process of enumerating and calculating the index digest is continuous. In this process, the number of files may be thousands to hundreds of thousands, or even more, so that the content in the index summary list may be transmitted in batches, or may be transmitted centrally.

In one embodiment, after receiving the index summary list, the recognition server searches the malicious file library by using each index summary in the list as a key field key. If one index abstract in the index abstract list hits one index abstract in the malicious file library, the file corresponding to the index abstract is indicated to be a malicious file, so that the client is reminded of the existence of the malicious file, and further processing is carried out.

In one embodiment, as shown in fig. 9, the overall flow of identifying malicious files is: performing background analysis on the file to be analyzed, and determining a malicious file from the background analysis; and then calculating index abstract of each malicious file, and establishing a malicious file library. When whether the reported files of the client are malicious files or not is to be identified, calculating index digests of the reported files, searching the index digests of the reported files in a malicious file library, and hitting the index digests of the reported files after each hit, namely indicating that the reported files are malicious files; if all index summaries of the reported files are not hit, the fact that no malicious files exist in the reported files is indicated, and identification of the malicious files is finished.

For example, a file bot.exe, MD5 of which is 45F8D84CB112C8C7A862A773599C8358, is analyzed to determine that it is a malicious file. The index digest value is calculated as 14a362fc29fc7483, and then the index digest value is: 14a362fc29fc7483 and MD5:45F8D84CB112C8C7A862A773599C8358 is saved in a malicious file repository.

When the index digest lists reported by the clients are compared one by one, the index digest value of a certain file in the index digest list is found to be exactly 14a362fc29fc7483. Although its file name is system. Exe, because its index digest hits the line 14a362fc29fc7483 in the malicious file repository, the recognition server considers that the client contains the malicious file with MD5 of 45F8D84CB112C8C7a862a773599C8358, which needs to be hit.

The communication process of identifying malicious files in the embodiment of the present disclosure is described in detail below with reference to fig. 10.

As shown in fig. 10: the entities involved in the embodiment of the disclosure include a client, an identification server and a malicious file library. When the malicious file is identified: the client installs the identification system according to the embodiment of the disclosure and logs in to the identification server; the client selects a scanning position to scan the files in the position, and calls the identification system to calculate index digests of the files while scanning; the client uploads the index abstract result to the identification server; the identification server compares the index abstract result with a malicious file library, determines whether a malicious file exists or not, and obtains a comparison result; if the index abstract does not exist in the malicious file library, the identification server requests the client to analyze the corresponding file, calculates the index abstract of the file, and judges whether the file is a malicious file or not; after the comparison of all index digests in the index digest results is finished, updating a malicious file library according to the identified malicious file results, and sending a prompt to a client to remind the client of processing the malicious file.

The method principles of embodiments of the present disclosure are described in detail below to more fully demonstrate internal details of embodiments of the present disclosure.

Firstly, in order to ensure the security, the process of identifying the malicious file in the conventional technology uses encryption digest algorithms, such as MD5 and SHA algorithms, which have high security and cannot be cracked theoretically. However, these summary algorithms have the disadvantage of complex calculation process and large calculation amount. In practical cloud searching applications, the files of the client are about 1MB to 500MB, the number of files is more thousands to hundreds of thousands, and the process of computing the digests of the files of the client by using the encryption digest algorithms and identifying the digests of the files of the client is extremely large in computing amount, which can cause extremely large burden to the client.

If the encryption digest algorithm with high security is not used, the common non-encryption digest algorithm (such as CRC 32) is used instead, and although the digest calculation speed is high, the digest is very easy to attack because the security is not considered, so that the method has no practical use significance.

In one embodiment, as shown in FIG. 11: for a file, a general non-encryption digest algorithm (such as CRC 32) is used for calculating to obtain a digest of the file, namely hash; if the file is embedded with malicious fragments, the abstract of the file is changed to be hash1; however, since the digest algorithm does not have security, an attacker can modify another insignificant place in the file (i.e., a place where the file is not destroyed), so that the digest-hash 2 calculated after modification is consistent with the digest-hash of the file without the malicious fragment implanted, thereby bypassing the recognition.

While applying the disclosed embodiments can make it extremely difficult for an attacker to bypass recognition. As shown in fig. 12: by applying the embodiment of the disclosure, data of a file is grouped, a file element matrix of the file is obtained, a first abstract and a second abstract of the file are obtained on the basis of the file element matrix, and finally an index abstract of the file is obtained.

If it isAn attacker modifies a in the matrix _ij Then corresponding x _i 、x _l 、y _j And y _k Changes will occur and the index digest of the file will change. If an attacker were to bypass identification, keeping the index digest of the file unchanged, at least three more places, namely a _ik 、a _lj And a _lk Modifications are made accordingly so that the final calculated index digest is unchanged. This problem then becomes a solution to the system of exclusive or equations:

that is, a and x are known _i 、x _l 、y _j 、y _k Solving for a _ik 、a _lj And a _lk 。

Solving the ternary exclusive OR equation system can be realized by a Gaussian elimination method after elimination of elementsThere is a unique solution at times, otherwise there is no solution.

Thus, if an attacker modifies a block in a file, it is necessary to carefully construct at least three other places so that the index digest value is unchanged. However, since most of the bytecodes in the PE file are meaningful, modifying them tends to destroy them, and render them inoperable. Thus, even if an attacker spends a lot of resources modifying other data in the file, the index digest value is unchanged. But it is likely that the file is destroyed, rendering the file inoperable, and the malicious fragments embedded in the file are not as good as any malicious intent they would have originally reached.

In addition, multiple modifications often construct a more difficult equation, and multiple solutions are introduced, so that confusion factors with corresponding intensities can be introduced or multidimensional grouping can be performed to meet corresponding security requirements.

As shown in fig. 13, there is further provided an apparatus for identifying a malicious file according to an embodiment of the present disclosure, the apparatus including:

a grouping module 410, configured to group the data of the file to be identified according to a predetermined data grouping rule, wherein each group of data is taken as one element in the element matrix of the file to be identified to form the element matrix of the file to be identified;

the element set vector generating module 420 is configured to generate a first vector of each first type element set based on each first type element set in the element matrix of the file to be identified, and generate a second vector of each second type element set based on each second type element set in the element matrix of the file to be identified, where the first type element set and the second type element set are determined by dividing elements in the element matrix of the file to be identified according to a predetermined element dividing rule;

the first digest obtaining module 430 is configured to perform a digest on the combined vectors of the first class element sets, so as to obtain a first digest of the file to be identified;

A second summary obtaining module 440, configured to summarize the combined vectors of the second vectors of each second class element set, so as to obtain a second summary of the file to be identified;

the index digest obtaining module 450 is configured to obtain a combined vector of the first digest and the second digest of the file to be identified, and obtain an index digest of the file to be identified;

the comparison module 460 is configured to compare the index digest of the file to be identified with the index digest of the malicious file in the preset malicious file library, so as to determine whether the file to be identified is a malicious file.

In an embodiment, the data of the file to be identified includes bytes in a binary representation of the file to be identified, the data packet of the file to be identified, each group of data divided into as one element in the element matrix of the file to be identified, includes:

In an embodiment, the data of the file to be identified includes characters of the file to be identified, the data of the file to be identified is grouped, and each group of data is divided into one element in the element matrix of the file to be identified, including:

and sequentially converting a preset number of characters into bit strings from the file to be identified as one element, taking out the bit strings, and filling the bit strings into blank matrixes with preset matrix widths in a sequence from top to bottom and from left to right.

In one embodiment, after sequentially taking out a predetermined number of characters from the file to be identified and converting the characters into bit strings as one element, filling the bit strings into blank matrixes with preset matrix widths in a sequence from top to bottom and from left to right, the method further comprises:

In an embodiment, after grouping the data of the file to be identified, each group of data is divided into one element in the element matrix of the file to be identified, and the element matrix of the file to be identified is formed, the method further includes:

Generating a row vector of a row based on elements of the row in the element matrix of the file to be identified, and generating a column vector of a column based on elements of the column in the element matrix of the file to be identified;

summarizing the combined vector of the first vectors of each first class element set to obtain a first summary of the file to be identified, wherein the summarizing comprises the following steps:

summarizing the combined vector of the row vectors of each row to obtain a first summary of the file to be identified;

summarizing the combined vector of the second vectors of each second class element set to obtain a second summary of the file to be identified, wherein the summarizing comprises the following steps:

In an embodiment, the generating a row vector of a row based on the element of the row in the file element matrix to be identified, and generating a column vector of a column based on the element of the column in the file element matrix to be identified includes:

In an embodiment, the combination vector of the row vectors of each row is a vector formed by cascading the row vectors of each row, and the combination vector of the column vectors of each column is a vector formed by cascading the column vectors of each column.

In an embodiment, the combined vector of the first digest and the second digest is a vector formed by sequentially concatenating the first digest and the second digest.

In one embodiment, the malicious file library stores the full text abstract of the malicious file corresponding to the index abstract, the full text abstract is obtained by applying a preset abstract algorithm to the full text of the malicious file, is used for searching the original file of the malicious file,

In an embodiment, the elements in the file element matrix to be identified and the elements in the malicious file element matrix contain data of the same size; the summarization algorithm for summarizing the combined vector of the first vectors of the first class element sets in the element matrix of the file to be identified is the same as the summarization algorithm for summarizing the combined vector of the first vectors of the first class element sets in the element matrix of the malicious file; the digest algorithm for summarizing the combined vector of the first digest and the second digest of the file to be identified is the same as the digest algorithm for summarizing the combined vector of the first digest and the second digest of the malicious file.

The method for identifying malicious files according to the embodiments of the present disclosure may be implemented by the client 300 and the identification server 200 shown in fig. 1 together. A terminal 500 for identifying a malicious file according to an embodiment of the present disclosure is described below with reference to fig. 14. The terminal for identifying malicious files shown in fig. 14 is only one example, and should not impose any limitation on the functions and scope of use of the embodiments of the present disclosure.

As shown in fig. 14, the terminal that recognizes the malicious file is in the form of a general-purpose computing device. Components of the terminal that identify malicious files may include, but are not limited to: the at least one processing unit 510, the at least one memory unit 520, and a bus 530 connecting the various system components, including the memory unit 520 and the processing unit 510.

Wherein the storage unit stores program code that is executable by the processing unit 510 such that the processing unit 510 performs the steps according to various exemplary embodiments of the present invention described in the description of the exemplary methods described above in this specification. For example, the processing unit 510 may perform the various steps as shown in fig. 3.

The storage unit 520 may include readable media in the form of volatile storage units, such as Random Access Memory (RAM) 5201 and/or cache memory unit 5202, and may further include Read Only Memory (ROM) 5203.

The storage unit 520 may also include a program/utility 5204 having a set (at least one) of program modules 5205, such program modules 5205 including, but not limited to: an operating system, one or more application programs, other program modules, and program data, each or some combination of which may include an implementation of a network environment.

Bus 530 may be one or more of several types of bus structures including a memory unit bus or memory unit controller, a peripheral bus, an accelerated graphics port, a processing unit, or a local bus using any of a variety of bus architectures.

The malicious file-identifying terminal may also communicate with one or more external devices 600 (e.g., keyboard, pointing device, bluetooth device, etc.), with one or more devices that enable a user to interact with the malicious file-identifying terminal, and/or with any device (e.g., router, modem, etc.) that enables the malicious file-identifying terminal to communicate with one or more other computing devices. Such communication may occur through an input/output (I/O) interface 550. Also, a terminal that identifies malicious files may also communicate with one or more networks, such as a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network, such as the internet, through network adapter 560. As shown, network adapter 560 communicates with other modules of the terminal that identify malicious files via bus 530. It should be appreciated that although not shown in the figures, other hardware and/or software modules may be used in connection with a terminal that identifies malicious files, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, data backup storage systems, and the like.

From the above description of embodiments, those skilled in the art will readily appreciate that the example embodiments described herein may be implemented in software, or may be implemented in software in combination with the necessary hardware. Thus, the technical solution according to the embodiments of the present disclosure may be embodied in the form of a software product, which may be stored in a non-volatile storage medium (may be a CD-ROM, a U-disk, a mobile hard disk, etc.) or on a network, including several instructions to cause a computing device (may be a personal computer, a server, a terminal device, or a network device, etc.) to perform the method according to the embodiments of the present disclosure.

In an exemplary embodiment of the present disclosure, there is also provided a computer program medium having stored thereon computer readable instructions, which when executed by a processor of a computer, cause the computer to perform the method described in the method embodiment section above.

According to an embodiment of the present disclosure, there is also provided a program product for implementing the method in the above method embodiments, which may employ a portable compact disc read only memory (CD-ROM) and comprise program code, and may be run on a terminal device, such as a personal computer. However, the program product of the present invention is not limited thereto, and in this document, a readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

The program product may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. The readable storage medium can be, for example, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium would include the following: an electrical connection having one or more wires, a portable disk, a hard disk, a random access memory (RGM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

The computer readable signal medium may include a data signal propagated in baseband or as part of a carrier wave with readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A readable signal medium may also be any readable medium that is not a readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.

Program code embodied on a readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.

Program code for carrying out operations of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as JGvG, c++ or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device, partly on a remote computing device, or entirely on the remote computing device or server. In the case of remote computing devices, the remote computing device may be connected to the user computing device through any kind of network, including a local area network (KGN) or a wide area network (WGN), or may be connected to an external computing device (e.g., connected over the internet using an internet service provider).

It should be noted that although in the above detailed description several modules or units of a device for action execution are mentioned, such a division is not mandatory. Indeed, the features and functionality of two or more modules or units described above may be embodied in one module or unit in accordance with embodiments of the present disclosure. Conversely, the features and functions of one module or unit described above may be further divided into a plurality of modules or units to be embodied.

Furthermore, although the steps of the methods in the present disclosure are depicted in a particular order in the drawings, this does not require or imply that the steps must be performed in that particular order or that all illustrated steps be performed in order to achieve desirable results. Additionally or alternatively, certain steps may be omitted, multiple steps combined into one step to perform, and/or one step decomposed into multiple steps to perform, etc.

From the above description of embodiments, those skilled in the art will readily appreciate that the example embodiments described herein may be implemented in software, or may be implemented in software in combination with the necessary hardware. Thus, the technical solution according to the embodiments of the present disclosure may be embodied in the form of a software product, which may be stored in a non-volatile storage medium (may be a CD-ROM, a U-disk, a mobile hard disk, etc.) or on a network, including several instructions to cause a computing device (may be a personal computer, a server, a mobile terminal, or a network device, etc.) to perform the method according to the embodiments of the present disclosure.

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This application is intended to cover any adaptations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

Claims

1. A method of identifying malicious files, the method comprising:

comparing the index abstract of the file to be identified with the index abstract of the malicious file in a preset malicious file library, so as to determine whether the file to be identified is a malicious file; the method for calculating the index abstract of the malicious file in the preset malicious file library is the same as the method for calculating the index abstract of the file to be identified.

2. The method according to claim 1, wherein the index digest of the malicious file in the preset malicious file repository is obtained in advance by:

3. The method according to claim 1, wherein the data of the file to be identified comprises bytes in a binary representation of the file to be identified, the data of the file to be identified is grouped according to a predetermined data grouping rule, each group of data divided as one element in a matrix of file elements to be identified, and the matrix of file elements to be identified is composed, comprising:

4. The method according to claim 1, wherein the data of the file to be identified includes characters of the file to be identified, the data of the file to be identified is grouped according to a predetermined data grouping rule, each group of data is divided into one element of a file element matrix to be identified, and the element matrix to be identified is formed by using each group of data as one element of the file element matrix to be identified, including:

5. The method according to claim 3 or 4, wherein after filling the blank matrices of the predetermined matrix width in a top-down, left-to-right order, the method further comprises:

6. The method according to claim 1, wherein after grouping data of the file to be identified according to a predetermined data grouping rule, each group of data divided as one element of the file element matrix to be identified, constitutes the file element matrix to be identified, the method further comprises:

7. The method of claim 1, wherein determining each row of elements in the matrix of file elements to be identified as a corresponding first set of elements and each column of elements in the matrix of file elements to be identified as a corresponding second set of elements according to a predetermined element partitioning rule, wherein generating a first vector for the first set of elements based on each first set of elements in the matrix of file elements to be identified, and generating a second vector for the second set of elements based on each second set of elements in the matrix of file elements to be identified, comprises:

8. The method of claim 7, wherein generating a row vector for a row based on elements of the row in the matrix of file elements to be identified and generating a column vector for a column based on elements of the column in the matrix of file elements to be identified comprises:

9. The method of claim 7, wherein the combined vector of row vectors for each row is a vector of row vectors for each row concatenated, and the combined vector of column vectors for each column is a vector of column vectors concatenated.

10. The method of claim 1, wherein the combined vector of the first digest and the second digest is a vector in which the first digest and the second digest are sequentially concatenated.

11. The method of claim 1, wherein the malicious file repository stores a full text digest of the malicious file corresponding to the index digest, the full text digest being derived by applying a predetermined digest algorithm to the full text of the malicious file for searching for an original file of the malicious file,

12. The method of claim 2, wherein the elements in the matrix of file elements to be identified and the elements in the matrix of malicious file elements contain data of the same size; the summarization algorithm for summarizing the combined vector of the first vectors of the first class element sets in the element matrix of the file to be identified is the same as the summarization algorithm for summarizing the combined vector of the first vectors of the first class element sets in the element matrix of the malicious file; the digest algorithm for summarizing the combined vector of the first digest and the second digest of the file to be identified is the same as the digest algorithm for summarizing the combined vector of the first digest and the second digest of the malicious file.

13. An apparatus for identifying malicious files, the apparatus comprising:

the comparison module is configured to compare the index abstract of the file to be identified with the index abstract of the malicious file in a preset malicious file library so as to determine whether the file to be identified is a malicious file or not; the method for calculating the index abstract of the malicious file in the preset malicious file library is the same as the method for calculating the index abstract of the file to be identified.

14. An electronic device for identifying malicious files, comprising:

a memory storing computer readable instructions;

a processor reading computer readable instructions stored in a memory to perform the method of any one of claims 1-12.

15. A computer program medium having computer readable instructions stored thereon, which, when executed by a processor of a computer, cause the computer to perform the method of any of claims 1-12.