CN112445760B

CN112445760B - File classification method, device, storage medium and apparatus

Info

Publication number: CN112445760B
Application number: CN202011275211.8A
Authority: CN
Inventors: 徐传宇; 党亮; 王士聪
Original assignee: 360 Digital Security Technology Group Co Ltd
Current assignee: 360 Digital Security Technology Group Co Ltd
Priority date: 2020-11-13
Filing date: 2020-11-13
Publication date: 2024-05-14
Anticipated expiration: 2040-11-13
Also published as: CN112445760A

Abstract

The invention discloses a file classification method, equipment, a storage medium and a device, wherein the method comprises the following steps: acquiring a file to be processed and file information of the file to be processed, extracting character information of the file to be processed, acquiring structural character information of the file to be processed, determining a file index value according to the structural character information and the file information, and classifying the file to be processed according to the file index value; compared with the existing method for manually analyzing the code characteristics of the sample file and classifying the sample file, the method and the device for classifying the sample file have the advantages that the file index value is determined through the file information and the structure character information of the file to be processed, and the file to be processed is classified according to the file index value, so that the defects of low file classification efficiency and poor reliability in the prior art are overcome, the file classification process can be optimized, the file classification efficiency is improved, and the file classification reliability is guaranteed.

Description

File classification method, device, storage medium and apparatus

Technical Field

The present invention relates to the field of internet technologies, and in particular, to a method, an apparatus, a storage medium, and a device for classifying files.

Background

Currently, when a user performs analysis on a sample file, the sample file is generally downloaded to a local computer, and then code features of the sample file are manually analyzed to classify the sample file.

However, the above method requires manual analysis of the sample file, which results in low file classification efficiency and poor reliability.

The foregoing is provided merely for the purpose of facilitating understanding of the technical solutions of the present invention and is not intended to represent an admission that the foregoing is prior art.

Disclosure of Invention

The invention mainly aims to provide a file classification method, equipment, a storage medium and a device, which aim to solve the technical problem of optimizing a file classification process.

In order to achieve the above object, the present invention provides a file classification method, comprising the steps of:

Acquiring a file to be processed and file information of the file to be processed;

extracting character information from the file to be processed to obtain structural character information of the file to be processed;

and determining a file index value according to the structural character information and the file information, and classifying the files to be processed according to the file index value.

Optionally, the step of determining a file index value according to the structural character information and the file information and classifying the file to be processed according to the file index value specifically includes:

Acquiring entry data of the file to be processed, and generating a primary index value according to the entry data and the structural character information;

Generating a secondary index value of the file to be processed according to the file information;

Generating a file index value according to the primary index value and the secondary index value, and classifying the files to be processed according to the file index value.

Optionally, the step of obtaining the entry data of the file to be processed and generating a first-level index value according to the entry data and the structural character information specifically includes:

determining signature characters, mark characters and file attribute information according to the structure character information;

determining an attribute data index value of the file to be processed according to the signature character, the flag character and the file attribute information;

acquiring the entry data of the file to be processed, and searching an entry data index value corresponding to the entry data;

and generating a first-level index value according to the attribute data index value and the entry data index value.

Optionally, the step of determining signature characters, flag characters and file attribute information according to the structural character information specifically includes:

determining header character position information, flag characters and file attribute information according to the structure character information;

And determining signature character position information according to the head character position information, and determining signature characters according to the signature character position information.

Optionally, the step of determining the attribute data index value of the file to be processed according to the signature character, the flag character and the file attribute information specifically includes:

judging whether the file to be processed is a legal file according to the signature character, and obtaining a file judgment result;

determining a flag field according to the flag character, and determining file bit number information of a file to be processed according to the flag field;

and determining the attribute data index value of the file to be processed according to the file judgment result, the file digit information and the file attribute information.

Optionally, the step of determining the attribute data index value of the file to be processed according to the file judgment result, the file bit number information and the file attribute information specifically includes:

Extracting information from the attribute information to obtain basic information, platform information, debugging information, resource information, repositioning table information, export table information, version information, program execution entry information and section table information;

Information fusion is carried out on the platform information, the debugging information, the resource information, the repositioning table information and the export table information, and fusion information is obtained;

And determining a file attribute index value according to the basic information, the fusion information, the version information, the program execution inlet information and the section table information.

Optionally, the step of determining a file attribute index value according to the basic information, the fusion information, the version information, the program execution entry information and the section table information specifically includes:

Classifying the files to be processed according to the basic information, and determining a basic information index value according to a classification result;

determining fusion data according to the fusion information, and determining a fusion information index value according to the fusion data;

Determining a version index value according to the version information;

Determining a program entry category according to the program execution entry information, and determining a program entry index value according to the program entry category;

extracting features of the section table information to obtain section table feature information, and determining a section table index value according to the section table feature information;

and determining a file attribute index value according to the basic information index value, the fusion information index value, the version index value, the program entry index value and the section table index value.

Optionally, the step of determining the version index value according to the version information specifically includes:

information screening is carried out on the version information to obtain major version number information and minor version number information;

and generating a version index value according to the primary version number information and the secondary version number information.

Optionally, the step of generating the secondary index value of the file to be processed according to the file information specifically includes:

extracting information from the file information to obtain file import table information and export information;

Determining a file import table index value according to the file import table information;

determining a data classification index value according to the derived information and the resource information;

and generating a secondary index value of the file to be processed according to the data classification index value and the resource judgment index value.

Optionally, the step of determining the data classification index value according to the derived information and the resource information specifically includes:

judging whether the file to be processed contains a derived function according to the derived information, and obtaining a function judgment result;

Judging whether the file to be processed contains resource data according to the resource information, and obtaining a resource judgment result;

and determining a data classification index value according to the function judgment result and the resource judgment result.

Optionally, the step of generating a file index value according to the primary index value and the secondary index value, and classifying the file to be processed according to the file index value specifically includes:

Generating a file index value according to the primary index value and the secondary index value;

Traversing the file to be processed, and taking the traversed file to be processed as a current file;

Taking the files to be processed except the current file as files to be matched, and matching the file index value of the current file with the file index value of the files to be matched to obtain a matching result;

And after the traversal of the files to be processed is finished, classifying the files to be processed according to the matching result.

Optionally, before the step of extracting the character information of the file to be processed to obtain the structural character information of the file to be processed, the file classification method further includes:

extracting the characteristics of the file to be processed to obtain the file characteristics of the file to be processed;

matching the file characteristics with sample characteristics in a preset virus library to obtain a matching result;

Correspondingly, the step of extracting the character information of the file to be processed to obtain the structural character information of the file to be processed specifically comprises the following steps:

And when the matching result is that the matching fails, extracting character information from the file to be processed to obtain the structural character information of the file to be processed.

In addition, in order to achieve the above object, the present invention also proposes a document sorting apparatus including a memory, a processor, and a document sorting program stored on the memory and executable on the processor, the document sorting program being configured to implement the steps of the document sorting method as described above.

In addition, in order to achieve the above object, the present invention also proposes a storage medium having stored thereon a file sort program which, when executed by a processor, implements the steps of the file sort method as described above.

In addition, in order to achieve the above object, the present invention also provides a document sorting apparatus, including: the device comprises an acquisition module, an extraction module and a classification module;

the acquisition module is used for acquiring a file to be processed and file information of the file to be processed;

the extraction module is used for extracting character information of the file to be processed and obtaining structural character information of the file to be processed;

and the classification module is used for determining a file index value according to the structural character information and the file information and classifying the file to be processed according to the file index value.

Optionally, the classification module is further configured to obtain entry data of the file to be processed, and generate a first-level index value according to the entry data and the structural character information;

the classification module is further used for generating a secondary index value of the file to be processed according to the file information;

The classification module is further configured to generate a file index value according to the primary index value and the secondary index value, and classify the file to be processed according to the file index value.

Optionally, the classification module is further configured to determine signature characters, flag characters and file attribute information according to the structural character information;

the classification module is further configured to determine an attribute data index value of the file to be processed according to the signature character, the flag character and the file attribute information;

The classification module is further used for acquiring the entry data of the file to be processed and searching an entry data index value corresponding to the entry data;

the classification module is further configured to generate a first-level index value according to the attribute data index value and the entry data index value.

Optionally, the classification module is further configured to determine header character position information, flag characters, and file attribute information according to the structural character information;

The classifying module is also used for determining signature character position information according to the head character position information and determining signature characters according to the signature character position information.

Optionally, the classification module is further configured to determine whether the file to be processed is a legal file according to the signature character, so as to obtain a file determination result;

The classification module is further used for determining a flag field according to the flag character and determining file bit number information of the file to be processed according to the flag field;

The classification module is further configured to determine an attribute data index value of the file to be processed according to the file judgment result, the file bit number information and the file attribute information.

Optionally, the classification module is further configured to extract information from the attribute information to obtain basic information, platform information, debug information, resource information, relocation table information, export table information, version information, program execution entry information, and section table information;

the classification module is further configured to perform information fusion on the platform information, the debug information, the resource information, the relocation table information and the export table information, so as to obtain fusion information;

the classification module is further configured to determine a file attribute index value according to the base information, the fusion information, the version information, the program execution entry information, and the section table information.

In the method, a file to be processed and file information of the file to be processed are obtained, character information extraction is carried out on the file to be processed, structural character information of the file to be processed is obtained, a file index value is determined according to the structural character information and the file information, and the file to be processed is classified according to the file index value; compared with the existing method for manually analyzing the code characteristics of the sample file and classifying the sample file, the method and the device for classifying the sample file have the advantages that the file index value is determined through the file information and the structure character information of the file to be processed, and the file to be processed is classified according to the file index value, so that the defects of low file classification efficiency and poor reliability in the prior art are overcome, the file classification process can be optimized, the file classification efficiency is improved, and the file classification reliability is guaranteed.

Drawings

FIG. 1 is a schematic diagram of a file classification device of a hardware runtime environment according to an embodiment of the present invention;

FIG. 2 is a flowchart of a first embodiment of a document classification method according to the present invention;

FIG. 3 is a flowchart of a second embodiment of a document classification method according to the present invention;

FIG. 4 is a flowchart of a third embodiment of a document classification method according to the present invention;

fig. 5 is a block diagram showing a first embodiment of the document sorting apparatus according to the present invention.

The achievement of the objects, functional features and advantages of the present invention will be further described with reference to the accompanying drawings, in conjunction with the embodiments.

Detailed Description

It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.

Referring to fig. 1, fig. 1 is a schematic diagram of a file classification device of a hardware running environment according to an embodiment of the present invention.

As shown in fig. 1, the file classifying apparatus may include: a processor 1001, such as a central processing unit (Central Processing Unit, CPU), a communication bus 1002, a user interface 1003, a network interface 1004, a memory 1005. Wherein the communication bus 1002 is used to enable connected communication between these components. The user interface 1003 may include a Display (Display), and the optional user interface 1003 may also include a standard wired interface, a wireless interface, and the wired interface for the user interface 1003 may be a USB interface in the present invention. The network interface 1004 may optionally include a standard wired interface, a wireless interface (e.g., a wireless FIdelity (WI-FI) interface). The Memory 1005 may be a high-speed random access Memory (Random Access Memory, RAM) Memory or a stable Memory (NVM), such as a disk Memory. The memory 1005 may also optionally be a storage device separate from the processor 1001 described above.

It will be appreciated by those skilled in the art that the structure shown in fig. 1 is not limiting of the document sorting apparatus, and may include more or fewer components than shown, or may combine certain components, or a different arrangement of components.

As shown in fig. 1, an operating system, a network communication module, a user interface module, and a file sort program may be included in a memory 1005, which is considered a type of computer storage medium.

In the file classification device shown in fig. 1, the network interface 1004 is mainly used for connecting to a background server, and performing data communication with the background server; the user interface 1003 is mainly used for connecting user equipment; the file classifying apparatus calls a file classifying program stored in the memory 1005 through the processor 1001 and performs the file classifying method provided by the embodiment of the present invention.

Based on the above hardware structure, an embodiment of the file classification method of the present invention is provided.

Referring to fig. 2, fig. 2 is a flowchart illustrating a first embodiment of a file classification method according to the present invention.

In a first embodiment, the file classification method includes the steps of:

Step S10: and acquiring the file to be processed and file information of the file to be processed.

It should be understood that the execution body of the embodiment is the file classification device, where the file classification device may be an electronic device such as a personal computer or a server, or may be other devices that may implement the same or similar functions, and this embodiment is not limited thereto.

It should be noted that, the file to be processed may be a sample file input by the user through the file classification device; or the sample file input by the user through the terminal device which establishes communication connection with the file classification device in advance. The sample file may be a file that is preset by a user and needs to be extracted from features, which is not limited in this embodiment; the file information may include file import table information, file export information, file resource information, etc., which is not limited in this embodiment.

Further, in order to avoid repeated analysis of the file to be processed, to improve the processing efficiency, before the file to be processed and the file information of the file to be processed are obtained, the method further includes:

And extracting the characteristics of the file to be processed to obtain the file characteristics of the file to be processed, matching the file characteristics with sample characteristics in a preset virus library to obtain a matching result, and extracting the character information of the file to be processed to obtain the structural character information of the file to be processed when the matching result is that the matching is failed.

Step S20: and extracting character information from the file to be processed to obtain structural character information of the file to be processed.

It should be noted that, the structure character information may include header characters, signature characters, flag characters, basic information, debug information, resource information, relocation table information, export table information, version information, program execution entry information, section table information, and the like, which is not limited in this embodiment.

The signature character may be a PE signature character and the logo character may be a Magic character; the file attribute information may include basic information, debug information, resource information, relocation table information, export table information, version information, program execution entry information, and section table information, which is not limited in this embodiment.

The basic information can be an image_file_header corresponding to CHARACTERISTICS;

The platform information may be an image_direct_entry_com_ DESCRIPTO data DIRECTORY pointer;

the DEBUG information may be an image_direct_entry_debug data DIRECTORY;

the RESOURCE information may be a data DIRECTORY of image_direct_entry_resource;

The relocation table information may be a data DIRECTORY of image_direct_entry_ BASERELOC;

the export table information may be a data DIRECTORY of image_direct_entry_ EXPORT;

The version information may be the values of IMAGE_ OPTIONAL _HEADER:: majorLinkVersion and:: minorLinkVersion;

Program execution entry information may be entry point information obtained through image_ OPTIONAL _header:: addressOfEntryPoint;

The section table information may be information of image_ SECTION _header.

Step S30: and determining a file index value according to the structural character information and the file information, and classifying the files to be processed according to the file index value.

It should be understood that determining the file index value according to the structural character information and the file information may be to obtain the file index value by processing the structural character information and the file information through a preset file index value script.

Further, in order to improve the file classification efficiency, the determining a file index value according to the structural character information and the file information, and classifying the file to be processed according to the file index value includes:

Obtaining the entry data of the file to be processed, generating a primary index value according to the entry data and the structural character information, generating a secondary index value of the file to be processed according to the file information, generating a file index value according to the primary index value and the secondary index value, and classifying the file to be processed according to the file index value.

Furthermore, in order to generate the primary index value quickly and accurately, to improve the processing efficiency, the obtaining the entry data of the file to be processed, and generating the primary index value according to the entry data and the structural character information includes:

Determining signature characters, mark characters and file attribute information according to the structure character information, determining attribute data index values of the files to be processed according to the signature characters, the mark characters and the file attribute information, acquiring entry data of the files to be processed, searching entry data index values corresponding to the entry data, and generating primary index values according to the attribute data index values and the entry data index values.

Furthermore, in order to generate the secondary index value rapidly and accurately, to improve the processing efficiency, the generating the secondary index value of the file to be processed according to the file information includes:

And extracting information from the file information to obtain file import table information and export information, determining a file import table index value according to the file import table information, determining a data classification index value according to the export information and the resource information, and generating a secondary index value of the file to be processed according to the data classification index value and the resource judgment index value.

Further, in order to improve reliability of file classification, the generating a file index value according to the primary index value and the secondary index value, and classifying the file to be processed according to the file index value includes:

generating a file index value according to the primary index value and the secondary index value, traversing the file to be processed, taking the traversed file to be processed as a current file, taking the files to be processed except the current file as files to be matched, matching the file index value of the current file with the file index value of the files to be matched to obtain a matching result, and classifying the files to be processed according to the matching result after the traversing of the files to be processed is finished.

In a first embodiment, a file to be processed and file information of the file to be processed are obtained, character information extraction is carried out on the file to be processed, structural character information of the file to be processed is obtained, a file index value is determined according to the structural character information and the file information, and the file to be processed is classified according to the file index value; compared with the existing manual analysis of code features of sample files, in the method for classifying the sample files, in the embodiment, the file index value is determined through the file information and the structure character information of the files to be processed, and the files to be processed are classified according to the file index value, so that the defects of low file classification efficiency and poor reliability in the prior art are overcome, the file classification process can be optimized, the file classification efficiency is improved, and the file classification reliability is guaranteed.

Referring to fig. 3, fig. 3 is a flowchart illustrating a second embodiment of the file classification method according to the present invention, and based on the first embodiment shown in fig. 2, the second embodiment of the file classification method according to the present invention is provided.

In a second embodiment, before the step S20, the method further includes:

Step S110: and extracting the characteristics of the file to be processed to obtain the file characteristics of the file to be processed.

It should be understood that, performing feature extraction on the file to be processed to obtain the file feature of the file to be processed may be performing feature identification extraction on the file to be processed to obtain feature identification information, and performing feature extraction on the file to be processed according to the feature identification information to obtain the file feature of the file to be processed, where the feature identification information may be information for identifying the feature, and this embodiment is not limited thereto.

Step S120: and matching the file characteristics with sample characteristics in a preset virus library to obtain a matching result.

It should be noted that the preset virus library may be a computer virus library preset by a user; the sample feature may be a virus sample feature stored in advance in a preset virus library by the user, which is not limited in this embodiment.

It should be understood that, matching the file feature with the sample feature in the preset virus library to obtain a matching result may be that the file feature is matched with the sample feature in the preset virus library to obtain a feature matching degree, and judging whether the feature matching degree is greater than the preset matching degree, and if the feature matching degree is greater than the preset matching degree, taking the matching success as the matching result; when the feature matching degree is smaller than or equal to the preset matching degree, the matching failure is used as a matching result, wherein the preset matching degree can be a matching value preset by a user, and the embodiment is not limited to the matching.

Accordingly, the step S20 includes:

Step S20': and when the matching result is that the matching fails, extracting character information from the file to be processed to obtain the structural character information of the file to be processed.

It should be understood that when the matching result is that the matching fails, it is indicated that the file characteristics of the file to be processed do not belong to the preset virus library, and feature analysis needs to be performed on the file to be processed. Therefore, the step of extracting the character information of the file to be processed to obtain the structural character information of the file to be processed can be performed.

It can be understood that when the matching result is that the matching is successful, the file characteristics of the file to be processed belong to a preset virus library, and the characteristic analysis of the file to be processed is not needed. Thus, the file classification can be exited directly.

In the second embodiment, the feature extraction is performed on the file to be processed to obtain the file feature of the file to be processed, the file feature is matched with the sample feature in the preset virus library to obtain a matching result, and when the matching result is that the matching is failed, the character information extraction is performed on the file to be processed to obtain the structural character information of the file to be processed, so that repeated analysis on the file to be processed can be effectively avoided, and the processing efficiency is improved.

In a second embodiment, the step S30 includes:

Step S301: and acquiring the entry data of the file to be processed, and generating a primary index value according to the entry data and the structural character information.

It may be appreciated that the first-level index value generated according to the entry data and the structural character information may be obtained by processing the entry data and the structural character information by a preset first-level index value generation script, where the preset first-level index value generation script may be an entry data and a structural character information processing script preset by a user, and this embodiment is not limited thereto.

Further, in order to quickly and accurately generate the first-level index value, to improve processing efficiency, the obtaining the entry data of the file to be processed, and generating the first-level index value according to the entry data and the structural character information includes:

Step S302: and generating a secondary index value of the file to be processed according to the file information.

It may be understood that the root generates the secondary index value of the file to be processed according to the file information, and processes the file information through a preset file information processing script to obtain the secondary index value, where the preset file information processing script may be a file information processing script preset by a user, and this embodiment is not limited in this regard.

Further, in order to quickly and accurately generate the secondary index value, to improve processing efficiency, the generating the secondary index value of the file to be processed according to the file information includes:

Step S303: generating a file index value according to the primary index value and the secondary index value, and classifying the files to be processed according to the file index value.

It may be understood that generating a file index value according to the primary index value and the secondary index value, and classifying the to-be-processed file according to the file index value may be generating a file index value according to the primary index value and the secondary index value, traversing the to-be-processed file, taking the traversed to-be-processed file as a current file, taking the to-be-processed files except the current file as to-be-matched files, matching the file index value of the current file with the file index value of the to-be-matched file, obtaining a matching result, and classifying the to-be-processed file according to the matching result after traversing the to-be-processed file.

In the second embodiment, the first-level index value is generated according to the entry data and the structural character information, the second-level index value of the file to be processed is generated according to the file information, the file index value is generated according to the first-level index value and the second-level index value, and the file to be processed is classified according to the file index value, so that the file classification efficiency can be improved.

Referring to fig. 4, fig. 4 is a schematic flow chart of a third embodiment of the file classification method according to the present invention, and based on the second embodiment shown in fig. 3, the third embodiment of the file classification method according to the present invention is proposed.

In a third embodiment, the step S301 includes:

Step S3011: and determining signature characters, sign characters and file attribute information according to the structure character information.

It should be noted that the signature character may be a PE signature character, and the flag character may be a Magic character; the file attribute information may include basic information, debug information, resource information, relocation table information, export table information, version information, program execution entry information, and section table information, which is not limited in this embodiment.

It may be understood that determining the signature character, the flag character, and the file attribute information according to the structure character information may be performing information extraction on the structure character information according to a preset extraction policy to obtain the signature character, the flag character, and the file attribute information, where the preset extraction policy may be an information extraction policy preset by a user, and this embodiment is not limited thereto.

Further, in order to ensure that the signature character has higher accuracy and reliability, the step S3011 includes:

It should be noted that, the header character position information may be DOS header position information; the present embodiment is not limited thereto.

It should be understood that determining the head character position information from the structural character information may be determining the character start position from the structural character information and determining the head character position information from the character start position.

It will be appreciated that the DOS header character is followed by the NT header character, where the NT header character includes the PE signature character. Therefore, the signature character position information can be directly determined according to the head character position information, and character extraction is performed according to the signature character position information, so that signature characters are obtained.

Step S3012: and determining the attribute data index value of the file to be processed according to the signature character, the flag character and the file attribute information.

Further, in order to improve accuracy of the attribute data index value, the step S3012 includes:

It should be noted that, the legal file may be a legal PE file; the file bit number information may be a 32-bit file or a 64-bit file, which is not limited by the comparison of the present embodiment.

It should be understood that, whether the file to be processed is a legal file is determined according to the signature character, the file determination result may be that the signature character is matched with a standard signature in a preset legal signature table, so as to obtain a signature matching result, and when the signature matching result is that the matching is successful, the file to be processed is determined to be a legal file, where the preset legal signature table may be signature information preset by a user, for example: MZ signature information of the PE file and PE signature information, which is not limited in this embodiment.

In a specific implementation, determining the Magic field from the marker character IMAGE OPTIONAL HEADER, for example, may determine how many bits of the file are. When the Magic field=0x10b, it indicates that the file is a 32-bit file, and when the Magic field=0x20b, it indicates that the file is a 64-bit file.

Further, considering that in practical application, if the attribute data index value of the file to be processed is directly determined according to the file judgment result, the file digit information and the file attribute information, the number of objects involved in the judgment process tends to be too large, and the operation amount is too large. To overcome this drawback, the determining the attribute data index value of the to-be-processed file according to the file judgment result, the file bit number information and the file attribute information includes:

It should be noted that the basic information can be the value corresponding to IMAGE_FILE_HEADER: CHARACTERISTICS;

the DEBUG information may be an image_direct_entry_debug data DIRECTORY;

The section table information may be information of image_ SECTION _header.

It may be understood that the information fusion is performed on the platform information, the debug information, the resource information, the relocation table information and the export table information, and the fusion information may be obtained by performing information fusion on the platform information, the debug information, the resource information, the relocation table information and the export table information based on a preset fusion policy, where the preset information fusion policy may be an information fusion policy preset by a user, and this embodiment is not limited to this.

Further, considering that in practical applications, if the file attribute index value is determined directly according to the base information, the fusion information, the version information, the program execution entry information and the section table information, the accuracy of the file attribute index value is likely to be low. To overcome this drawback, the determining a file attribute index value according to the base information, the fusion information, the version information, the program execution entry information, and the section table information includes:

Determining a version index value according to the version information;

It should be understood that, classifying the file to be processed according to the basic information, and determining the basic information index value according to the classification result may be determining a basic value according to the basic information, classifying the file to be processed according to the basic value, obtaining a file type, determining a numerical conversion rule according to the file type, and converting the basic value according to the numerical conversion rule to obtain the basic information index value.

It may be understood that determining the numerical conversion rule according to the file type may be searching for a numerical conversion rule corresponding to the file type in a preset conversion rule table, where the preset conversion rule table includes a correspondence between the file type and the numerical conversion rule, and the correspondence between the file type and the numerical conversion rule may be preset by a user.

It should be understood that determining the version index value according to the version information may be performing information filtering on the version information to obtain major version number information and minor version number information, and generating the version index value according to the major version number information and the minor version number information.

It may be understood that determining the program entry index value according to the program entry category may be searching for a program entry index value corresponding to the program entry category in a preset entry index table, where the preset entry index table includes a correspondence between the program entry category and the program entry index value, and the correspondence between the program entry category and the program entry index value may be preset by a user.

It should be understood that determining the file attribute index value according to the base information index value, the fusion information index value, the version index value, the program entry index value, and the section table index value may be determining the file attribute index value according to a preset file attribute index value generation policy according to the base information index value, the fusion information index value, the version index value, the program entry index value, and the section table index value, where the preset file attribute index value generation policy may be an index value synthesis policy preset by a user, and this embodiment is not limited thereto.

Further, to ensure that the generated version index value has higher accuracy and reliability, the determining the version index value according to the version information includes:

It may be understood that the generation of the version index value according to the primary version number information and the secondary version number information may be to search for a primary version number information weight value corresponding to the primary version number information, search for a secondary version number information weight value corresponding to the secondary version number information, and generate the version index value according to the primary version number information, the primary version number information weight value, the secondary version number information, and the secondary version number information weight value.

Step S3013: and acquiring the entry data of the file to be processed, and searching an entry data index value corresponding to the entry data.

It may be understood that searching for the entry data index value corresponding to the entry data may be searching for the entry data index value corresponding to the entry data in a preset entry data table, where the preset entry data table includes a correspondence between the entry data and the entry data index value, and the correspondence between the entry data and the entry data index value may be preset by a user, which is not limited in this embodiment.

Step S3014: and generating a first-level index value according to the attribute data index value and the entry data index value.

It should be understood that generating the primary index value according to the attribute data index value and the entry data index value may be determining the primary index value according to the attribute data index value and the entry data index value by a preset primary index value generation policy, where the preset primary index value generation policy may be an index value processing policy preset by a user, which is not limited in this embodiment.

In a third embodiment, signature characters, flag characters and file attribute information are determined according to the structure character information, attribute data index values of the files to be processed are determined according to the signature characters, the flag characters and the file attribute information, entry data of the files to be processed are obtained, entry data index values corresponding to the entry data are searched, and a primary index value is generated according to the attribute data index values and the entry data index values, so that a primary index value can be quickly and accurately generated, and processing efficiency is improved.

In a third embodiment, the step S302 includes:

Step S3021: and extracting information from the file information to obtain file import table information and export information.

It should be noted that the file import table information may be the number of import dynamic libraries, etc.; the export information may be the number of export functions, etc., which is not limited in this embodiment.

Step S3022: and determining a file import table index value according to the file import table information.

It should be understood that determining the file import table index value according to the file import table information may be searching for the file import table index value corresponding to the file import table information.

Step S3023: and determining a data classification index value according to the derived information and the resource information.

Further, the step S3023 includes:

It can be understood that when the function determination result is that the sample to be processed does not include the derived function and the resource determination result is that the file to be processed includes the resource data, the value is taken from the type and the number of the resources, and the data classification index value is obtained. For example, the fetch class index value is 0x5D73.

Step S3024: and generating a secondary index value of the file to be processed according to the data classification index value and the resource judgment index value.

It should be understood that the generation of the secondary index value of the file to be processed according to the data classification index value and the resource judgment index value may be the generation of the secondary index value of the file to be processed according to the data classification index value and the resource judgment index value by a preset secondary index value generation policy, where the preset secondary index value generation policy may be an index value processing policy preset by a user, which is not limited in this embodiment.

In a third embodiment, the file information is extracted to obtain file import table information and export information, a file import table index value is determined according to the file import table information, a data classification index value is determined according to the export information and the resource information, and a secondary index value of the file to be processed is generated according to the data classification index value and the resource judgment index value, so that an accurate and reliable secondary index value can be generated.

In a third embodiment, the step S303 includes:

Step S3031: and generating a file index value according to the primary index value and the secondary index value.

It may be understood that generating the file index value according to the primary index value and the secondary index value may be generating the file index value according to a preset file index generation policy according to the primary index value and the secondary index value, where the preset file index generation policy may be an index value processing policy preset by a user.

In a specific implementation, for example, the structural character information and the file information of Notepad.exe (Windows XP SP2, MD5:C9F225F98574759E377BCE6D87958C 9C) are: number of sections: 3, executable section number: 1, a step of; file type: an executable file (EXE); CPU architecture: 32 bits; entry bytes: 6a,70,68,98; the inlet is located in section 0; the number of derived functions: 0; import of dynamic library quantity: 9, a step of performing the process; debug information: comprises the following steps of; resource information: comprises the following steps of; relocation table: does not include; linker version: 7.10.

According to the above steps, the primary index value is 0x0B02F401, the secondary index value is 0x5D73B249, and the file index value is 0B02F4015D73B249.

Step S3032: traversing the file to be processed, and taking the traversed file to be processed as a current file.

It may be understood that the traversing of the file to be processed may be performed according to a preset sequence, or may be performed according to an uploading time of the file to be processed, where the preset sequence may be a processing sequence preset by a user, which is not limited in this embodiment.

Step S3033: and taking the files to be processed except the current file as files to be matched, and matching the file index value of the current file with the file index value of the files to be matched to obtain a matching result.

It can be understood that when the file index value of the current file is the same as the file index value of the file to be matched, the current file is judged to be successfully matched with the file to be matched.

Step S3034: and after the traversal of the files to be processed is finished, classifying the files to be processed according to the matching result.

It should be understood that after traversing the file to be processed is finished, it is indicated that the file to be processed is completely matched, and at this time, the file to be processed and the file to be matched successfully with the file to be processed may be divided into similar files.

In a third embodiment, a file index value is generated according to the primary index value and the secondary index value, the file to be processed is traversed, the traversed file to be processed is used as a current file, files to be processed except the current file are used as files to be matched, the file index value of the current file is matched with the file index value of the file to be matched, a matching result is obtained, and after the traversing of the file to be processed is finished, the file to be processed is classified according to the matching result, so that the classification rate of the file to be processed can be improved, and user experience is improved.

In addition, the embodiment of the invention also provides a storage medium, wherein the storage medium stores a file classification program, and the file classification program realizes the steps of the file classification method when being executed by a processor.

In addition, referring to fig. 5, an embodiment of the present invention further provides a file classification device, where the file classification device includes: an acquisition module 10, an extraction module 20 and a classification module 30;

the acquiring module 10 is configured to acquire a file to be processed and file information of the file to be processed.

The extracting module 20 is configured to extract character information of the file to be processed, and obtain structural character information of the file to be processed.

The classification module 30 is configured to determine a file index value according to the structural character information and the file information, and classify the file to be processed according to the file index value.

In this embodiment, a file to be processed and file information of the file to be processed are obtained, character information extraction is performed on the file to be processed, structural character information of the file to be processed is obtained, a file index value is determined according to the structural character information and the file information, and the file to be processed is classified according to the file index value; compared with the existing manual analysis of code features of sample files, in the method for classifying the sample files, in the embodiment, the file index value is determined through the file information and the structure character information of the files to be processed, and the files to be processed are classified according to the file index value, so that the defects of low file classification efficiency and poor reliability in the prior art are overcome, the file classification process can be optimized, the file classification efficiency is improved, and the file classification reliability is guaranteed.

Other embodiments or specific implementation manners of the file classification apparatus according to the present invention may refer to the above method embodiments, and are not described herein again.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or system that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or system. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or system that comprises the element.

The foregoing embodiment numbers of the present invention are merely for the purpose of description, and do not represent the advantages or disadvantages of the embodiments. In the unit claims enumerating several means, several of these means may be embodied by one and the same item of hardware. The use of the terms first, second, third, etc. do not denote any order, but rather the terms first, second, third, etc. are used to interpret the terms as names.

From the above description of the embodiments, it will be clear to those skilled in the art that the above-described embodiment method may be implemented by means of software plus a necessary general hardware platform, but of course may also be implemented by means of hardware, but in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art in the form of a software product stored in a storage medium (e.g. read only memory mirror (Read Only Memory image, ROM)/random access memory (Random Access Memory, RAM), magnetic disk, optical disk), comprising instructions for causing a terminal device (which may be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.) to perform the method according to the embodiments of the present invention.

The foregoing description is only of the preferred embodiments of the present invention, and is not intended to limit the scope of the invention, but rather is intended to cover any equivalents of the structures or equivalent processes disclosed herein or in the alternative, which may be employed directly or indirectly in other related arts.

Claims

1. A method of classifying documents, the method comprising the steps of:

determining a file index value according to the structural character information and the file information, and classifying the files to be processed according to the file index value;

Before the step of extracting the character information of the file to be processed to obtain the structural character information of the file to be processed, the file classification method further comprises the following steps:

2. The method for classifying files according to claim 1, wherein said step of determining a file index value based on said structural character information and said file information and classifying said files to be processed based on said file index value comprises:

3. The method for classifying files according to claim 2, wherein the step of obtaining the entry data of the files to be processed and generating the primary index value according to the entry data and the structural character information comprises the following steps:

4. The method of classifying documents as claimed in claim 3, wherein said step of determining signature character, logo character and document attribute information from said structural character information comprises:

5. The method for classifying files according to claim 4, wherein said step of determining the attribute data index value of said file to be processed based on said signature character, said flag character and said file attribute information comprises:

6. The method of classifying files according to claim 5, wherein said step of determining an index value of attribute data of said file to be processed based on said file judgment result, said file digit information, and said file attribute information comprises:

7. The method of classifying files according to claim 6, wherein said step of determining file attribute index values based on said base information, said fusion information, said version information, said program execution entry information, and said section table information comprises:

Determining a version index value according to the version information;

8. The method for classifying files according to claim 7, wherein said step of determining a version index value based on said version information comprises:

9. The method for classifying files according to claim 8, wherein said step of generating a secondary index value of said file to be processed based on said file information comprises:

and generating a secondary index value of the file to be processed according to the data classification index value and the file import table index value.

10. The method of classifying files according to claim 9, wherein said step of determining a data classification index value based on said derived information and said resource information comprises:

11. The method for classifying files according to claim 2, wherein the step of generating a file index value according to the primary index value and the secondary index value and classifying the files to be processed according to the file index value specifically comprises:

12. A document sorting apparatus, characterized in that the document sorting apparatus includes: memory, a processor and a file classification program stored on the memory and executable on the processor, which when executed by the processor implements the steps of the file classification method according to any of claims 1 to 11.

13. A storage medium having stored thereon a file sort program which, when executed by a processor, implements the steps of the file sort method of any one of claims 1 to 11.

14. A document sorting apparatus, characterized in that the document sorting apparatus includes: the device comprises an acquisition module, an extraction module and a classification module;

The classification module is used for determining a file index value according to the structural character information and the file information and classifying the file to be processed according to the file index value;

wherein, the file classification device further includes: a matching module;

The matching module is used for extracting the characteristics of the file to be processed to obtain the file characteristics of the file to be processed; matching the file characteristics with sample characteristics in a preset virus library to obtain a matching result;

Correspondingly, the extraction module is further configured to extract character information of the to-be-processed file when the matching result is that the matching fails, so as to obtain structural character information of the to-be-processed file.

15. The document classification apparatus of claim 14, wherein the classification module is further configured to obtain entry data of the document to be processed, and generate a primary index value according to the entry data and the structural character information;

16. The document classification apparatus of claim 15, wherein the classification module is further configured to determine signature characters, logo characters, and document attribute information based on the structural character information;

17. The document classification apparatus of claim 16, wherein the classification module is further configured to determine header character position information, flag characters, document attribute information based on the structural character information;

18. The document classification apparatus of claim 17, wherein the classification module is further configured to determine whether the document to be processed is a legal document according to the signature character, and obtain a document determination result;

19. The file classification apparatus of claim 18, wherein the classification module is further configured to perform information extraction on the attribute information to obtain base information, platform information, debug information, resource information, relocation table information, export table information, version information, program execution entry information, and section table information;