CN110825701A - File type determination method and device, electronic equipment and readable storage medium - Google Patents

File type determination method and device, electronic equipment and readable storage medium Download PDF

Info

Publication number
CN110825701A
CN110825701A CN201911083245.4A CN201911083245A CN110825701A CN 110825701 A CN110825701 A CN 110825701A CN 201911083245 A CN201911083245 A CN 201911083245A CN 110825701 A CN110825701 A CN 110825701A
Authority
CN
China
Prior art keywords
file
file type
classified
identification
type
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201911083245.4A
Other languages
Chinese (zh)
Inventor
蔡家坡
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sangfor Technologies Co Ltd
Original Assignee
Sangfor Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sangfor Technologies Co Ltd filed Critical Sangfor Technologies Co Ltd
Priority to CN201911083245.4A priority Critical patent/CN110825701A/en
Publication of CN110825701A publication Critical patent/CN110825701A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/16File or folder operations, e.g. details of user interfaces specifically adapted to file systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Human Computer Interaction (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application discloses a file type determining method, which is different from the prior art, the file type is recognized from files to be classified deeper through position fixed features and/or position non-fixed features, the file type is determined only on the basis of a surface layer and a suffix easy to modify in the prior art, and the position fixed features and the position non-fixed features are generated by files of corresponding file types according to preset standards, so that the active modification difficulty is extremely high, the accuracy of the file type determined on the basis of the features can be well guaranteed, and the invasion of malicious files disguised by the file types can be effectively prevented. The application also discloses a file type determining device, electronic equipment and a readable storage medium, and the file type determining device, the electronic equipment and the readable storage medium have the beneficial effects.

Description

File type determination method and device, electronic equipment and readable storage medium
Technical Field
The present application relates to the field of network security technologies, and in particular, to a method and an apparatus for determining a file type, an electronic device, and a readable storage medium.
Background
The file type is an important file type mode under the existing data file mechanism, and an application program class file with a suffix of EXE, an image class file with a suffix of JPG and a presentation file class file with a suffix of PPT are common.
In most of the file security detection mechanisms nowadays, a file detection rule set based on a file type is also an important ring. It should be understood that the file detection rule set based on the file type is most important to ensure that the detected file type is the true file type of the file to be classified, i.e. the accuracy of the detection result. The common file type mode based on file suffix can easily bypass the detection mechanism by actively modifying the file type mode into other types of suffixes, removing the suffixes and the like, thereby successfully invading the target network.
Therefore, how to provide a scheme capable of accurately detecting the file type of the file to be classified is an urgent problem to be solved by those skilled in the art.
Disclosure of Invention
The application aims to provide a file type determining method, a file type determining device, electronic equipment and a readable storage medium, and aims to improve the detection accuracy of file types of files to be classified.
To achieve the above object, the present application provides a file type determining method, including:
receiving an incoming file to be classified;
executing file type identification operation on the file to be classified by utilizing a preset first identification library and/or a preset second identification library to correspondingly obtain a first identification result and/or a second identification result; the first identification library records a first corresponding relation between each file type and each position fixed feature, and the second identification library records a second corresponding relation between each file type and each position non-fixed feature;
and determining the file type of the file to be classified according to the first recognition result and/or the second recognition result.
Optionally, the performing, by using a preset first identification library and/or a preset second identification library, a file type identification operation on the file to be classified to obtain a first identification result and a second identification result correspondingly, and determining the file type of the file to be classified according to the first identification result and/or the second identification result includes:
executing the file type identification operation on the file to be classified by utilizing the first identification library to obtain a first identification result;
when the file type of the file to be classified cannot be determined uniquely according to the first identification result, executing the file type identification operation on the file to be classified by using the second identification library to obtain a second identification result;
and determining the file type of the file to be classified by using the first recognition result and the second recognition result.
Optionally, the performing, by using the first recognition library, the file type recognition operation on the file to be classified to obtain the first recognition result includes:
performing feature matching operation on the file to be classified by using each position fixed feature contained in the first identification library to obtain a matched position fixed feature;
determining the file type corresponding to the fixed characteristic of the matching position as a first matching file type according to the first corresponding relation;
correspondingly, when the file type of the file to be classified cannot be uniquely determined according to the first identification result, the method comprises the following steps:
when the number of the first matching file types is greater than 1;
correspondingly, the file type identification operation is executed on the file to be classified by using the second identification library to obtain the second identification result, and the method comprises the following steps:
performing feature matching operation on the file to be classified by using each position non-fixed feature contained in the second identification library to obtain a matched position non-fixed feature;
determining the file type corresponding to the non-fixed characteristic of the matching position as a second matching file type according to the second corresponding relation;
correspondingly, determining the file type of the file to be classified according to the first recognition result and the second recognition result, including:
and determining the file type of the file to be classified according to the first matching file type and the second matching file type.
Optionally, determining the file type of the file to be classified according to the first matching file type and the second matching file type includes:
calculating to obtain a first probability according to the number and/or length of the fixed features of the matching positions of each first matching file type, and sorting according to the probability to obtain a first sorting table;
calculating according to the number and/or length of the non-fixed features of the matching position of each second matching file type to obtain a second probability, and sequencing according to the probability to obtain a second sequencing table;
accumulating the first probability and the second probability of the same file type in the first sorting table and the second sorting table to obtain a processed sorting table;
and taking each matched file type in the processed sorting table as a possible file type of the file to be classified.
Optionally, the first probability is obtained by calculating according to the number and/or length of the fixed features of the matching location of each first matching file type, and the first probability is sorted according to the size of the probability to obtain a first sorting table, including:
carrying out weighted calculation on each first matching file type according to the number and/or length of the corresponding fixed features of the matching positions by using a weighted calculation method to obtain each weighted score; wherein, the number corresponds to a preset first weight value, and the length corresponds to a preset second weight value;
and sorting the weighted scores according to sizes to obtain the first sorting table.
Optionally, when the file type of the file to be classified cannot be determined according to the first recognition result and/or the second recognition result, the method further includes:
executing text file feature recognition operation on the file to be classified to obtain a third recognition result;
and determining whether the file to be classified is a plain text file or not according to the third recognition result.
Optionally, the file type determining method further includes:
marking the file type determined according to the first recognition result and/or the second recognition result as a first type;
judging whether the second type is consistent with the first type; the second type is a file type determined according to a suffix of the file to be classified;
if not, attaching an abnormal mark to the file to be classified, and executing malicious content detection operation on the file to be classified attached with the abnormal mark.
To achieve the above object, the present application also provides a file type determining apparatus, including:
the file receiving unit to be classified is used for receiving the incoming file to be classified;
the file type identification unit is used for executing file type identification operation on the file to be classified by utilizing a preset first identification library and/or a preset second identification library to correspondingly obtain a first identification result and/or a second identification result; the first identification library records a first corresponding relation between each file type and each position fixed feature, and the second identification library records a second corresponding relation between each file type and each position non-fixed feature;
and the file type determining unit is used for determining the file type of the file to be classified according to the first recognition result and/or the second recognition result.
Optionally, the file type identifying unit includes:
the first identification subunit is configured to execute the file type identification operation on the file to be classified by using the first identification library to obtain the first identification result;
the second identification subunit is configured to, when the file type of the file to be classified cannot be uniquely determined according to the first identification result, perform the file type identification operation on the file to be classified by using the second identification library to obtain a second identification result;
and the file type common identification subunit is used for determining the file type of the file to be classified by using the first identification result and the second identification result.
Optionally, the first identification subunit includes:
the position fixed feature matching module is used for performing feature matching operation on the file to be classified by using each position fixed feature contained in the first identification library to obtain a matched position fixed feature;
the first matching file type determining module is used for determining the file type corresponding to the matching position fixed feature as a first matching file type according to the first corresponding relation;
correspondingly, the second identification subunit comprises:
the position non-fixed feature matching module is used for performing feature matching operation on the file to be classified by using each position non-fixed feature contained in the second identification library when the number of the first matching file types is greater than 1 to obtain a matching position non-fixed feature;
the second matching file type determining module is used for determining the file type corresponding to the non-fixed characteristic of the matching position as a second matching file type according to the second corresponding relation;
correspondingly, the file type common identification subunit comprises:
and the file type common determination module is used for determining the file type of the file to be classified according to the first matching file type and the second matching file type.
Optionally, the file type common determination module includes:
the first sequencing submodule is used for calculating to obtain a first probability according to the number and/or the length of the fixed features of the matching positions of each first matching file type, and sequencing according to the probability to obtain a first sequencing table;
the second sorting submodule is used for calculating to obtain a second probability according to the number and/or the length of the non-fixed features of the matching position of each second matching file type, and sorting according to the probability to obtain a second sorting table;
the accumulation processing submodule is used for accumulating the first probability and the second probability of the same file type in the first sorting table and the second sorting table to obtain a processed sorting table;
and the possible file type determining submodule is used for taking each matched file type in the processed sorting table as the possible file type of the file to be classified.
Optionally, the first ordering sub-module includes:
the weighting calculation component is used for carrying out weighting calculation on each first matching file type according to the number and/or length of the corresponding matching position fixed features by using a weighting calculation method to obtain each weighting score; wherein, the number corresponds to a preset first weight value, and the length corresponds to a preset second weight value;
and the score sorting component is used for sorting the weighted scores according to sizes to obtain the first sorting table.
Optionally, the file type determining apparatus further includes:
the text file feature recognition unit is used for executing text file feature recognition operation on the file to be classified to obtain a third recognition result when the file type of the file to be classified cannot be determined according to the first recognition result and/or the second recognition result;
and the plain text file determining unit is used for determining whether the file to be classified is a plain text file according to the third identification result.
Optionally, the file type determining apparatus further includes:
the first type determining unit is used for marking the file type determined according to the first recognition result and/or the second recognition result as a first type;
the type consistency judging unit is used for judging whether a second type is consistent with the first type; the second type is a file type determined according to a suffix of the file to be classified;
and the abnormal mark attaching and malicious content detecting unit is used for attaching an abnormal mark to the file to be classified and executing malicious content detecting operation on the file to be classified attached with the abnormal mark when the second type is inconsistent with the first type.
To achieve the above object, the present application also provides an electronic device, including:
a memory for storing a computer program;
a processor for implementing the file type determination method as described above when executing the computer program.
To achieve the above object, the present application also provides a readable storage medium having stored thereon a computer program which, when executed by a processor, implements the file type determination method as described above.
The file type determining method provided by the application comprises the following steps: receiving an incoming file to be classified; executing file type identification operation on the file to be classified by utilizing a preset first identification library and/or a preset second identification library to correspondingly obtain a first identification result and/or a second identification result; the first identification library records a first corresponding relation between each file type and each position fixed feature, and the second identification library records a second corresponding relation between each file type and each position non-fixed feature; and determining the file type of the file to be classified according to the first recognition result and/or the second recognition result.
According to the file type determining method provided by the application, the file type is recognized from the files to be classified deeper through the fixed position features and/or the non-fixed position features, the file type is determined on the basis of the surface layer and the suffix which is easy to modify, the fixed position features and the non-fixed position features are generated by the files of the corresponding file types according to the preset standard, and active modification difficulty is extremely high, so that the accuracy of the file type determined on the basis of the fixed position features and the intrusion of malicious files which are disguised by the file type can be well guaranteed, and further the intrusion of the malicious files which are disguised by the file type can be effectively prevented. The application also provides a file type determining device, an electronic device and a readable storage medium, which have the beneficial effects and are not repeated herein.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, it is obvious that the drawings in the following description are only embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.
Fig. 1 is a flowchart of a file type determining method according to an embodiment of the present application;
FIG. 2 is a diagram of an actual code identifying a location fix feature of an xls file type provided by an embodiment of the present application;
FIG. 3 is a diagram of an actual code for identifying a location fix feature of a WPS file type, provided by an embodiment of the present application;
FIG. 4 is a diagram of an actual code for identifying location fix features of JPG file types according to an embodiment of the present application;
FIG. 5 is a diagram of an actual code for identifying location fix features of AVI file types provided by an embodiment of the present application;
fig. 6 is a flowchart of a method for identifying file types of files to be classified sequentially by using a position fixing feature and a position non-fixing feature according to an embodiment of the present application;
fig. 7 is a flowchart of a method for identifying a file type of a file to be classified specifically by using a location-fixed feature and a location-non-fixed feature according to an embodiment of the present application;
fig. 8 is a flowchart of a method for determining a file type of a file to be classified according to a first matching file type and a second matching file type according to an embodiment of the present application;
fig. 9 is a block diagram illustrating a structure of a file type determining apparatus according to an embodiment of the present application;
fig. 10 is a schematic diagram of a file type determining apparatus in an application scenario according to an embodiment of the present application;
fig. 11 is a schematic structural diagram of an electronic device according to an embodiment of the present application.
Detailed Description
The application aims to provide a file type determining method, a file type determining device, electronic equipment and a readable storage medium, and aims to improve the detection accuracy of file types of files to be classified, and further better prevent the invasion of malicious contents disguised by the file types.
In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some embodiments of the present application, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
Referring to fig. 1, fig. 1 is a flowchart of a file type determining method according to an embodiment of the present application, including the following steps:
s101: receiving an incoming file to be classified;
the file to be classified refers to a data file that needs to be subjected to file type detection (or file type determination), and here, according to the embodiment of the present application, the execution subject that is introduced into the file to be classified may change accordingly.
For example, when the execution subject of the application is a network security device erected between an external network and an internal network, the file to be classified is mainly transmitted from a data transmission device of a file transmitted from the external network to the internal network, and in this scenario, the subsequent steps of the application are mainly used for achieving the purpose of preventing malicious content intrusion; when the execution main body of the method is a functional component of the data storage server in the intranet, the file to be classified is mainly transmitted from the intranet client, and in this scene, the subsequent steps of the method are mainly used for guaranteeing the reliability and safety of the data file of the data storage server. Of course, many other similar scenarios exist, and the adaptive adjustment may be performed according to actual situations, and is not limited in detail here.
S102: executing file type identification operation on the files to be classified by utilizing a preset first identification library and/or a preset second identification library to correspondingly obtain a first identification result and/or a second identification result;
wherein, the first recognition library records the corresponding relationship between each file type and each position fixing characteristic, and the corresponding relationship recorded in the first recognition library is marked as a first corresponding relationship; the second recognition library records the corresponding relationship between each file type and each position non-fixed characteristic, and here, the corresponding relationship recorded in the second recognition library is marked as a second corresponding relationship.
The position fixing characteristic refers to a characteristic which appears at a fixed position in the complete data length of a file to be classified, namely a characteristic that files of the same type all appear at a fixed certain position; correspondingly, the position non-fixed feature refers to a feature which appears at a non-fixed position in the complete data length of the file to be classified, that is, the same feature which may appear at different positions in different files of the same type.
Taking a file of the file type xls (a file type of a table class) as an example, it would normally appear as a signature at fixed positions 00000400h and 00000410 h: "Root Entry", and location-non-fixed features: "Work book" appears in one case at 00000480h and 00000490h (see fig. 2). Other, WPS type documents possess position fixing features: "W o R d Document", "R o t E n t R y", and position non-fixed features: "wpsoffficice" (see fig. 3); image files of the JPG type then possess a fixed-position feature (binary string): "xff \ xd8\ xff \ xe0\ x00\ x10\ x4a \ x 46" (see FIG. 4); 4) the AVI type video file has a fixed location feature: "RIFF", "AVI LIST" and second-level system strings: "\\ x00\ x00\ x68\ x64\ x72\ x6c \ x61\ x76\ x69\ x68\ x38\ x00\ x00\ x 00" (see FIG. 5).
The features referred to in this application are mostly specific character strings, that is, specific character strings composed of specific characters, but it is not excluded that other types of features may occur, and the features may be expressed in various ways as long as the fixed position or the non-fixed position is satisfied.
On the basis of S101, this step aims to obtain a corresponding recognition result by combining the first corresponding relationship and the second corresponding relationship with the position-fixed feature and/or the position-unfixed feature included in the file to be classified.
Specifically, the step includes three specific implementation manners:
firstly, identifying the file type of a file to be classified from the aspect of position fixed characteristics only by utilizing a first identification library recorded with a first corresponding relation;
secondly, only using a second identification library recorded with a second corresponding relation to identify the file type of the file to be classified from the aspect of position non-fixed characteristics;
and thirdly, identifying the file type of the file to be classified from the position fixed characteristic and the position non-fixed characteristic from two aspects, and improving the accuracy of the conclusion through the combination of the two aspects of characteristics.
It should be noted that, in the manner of identifying the file type of the file to be classified from both the fixed-position feature and the non-fixed-position feature, the fixed-position feature and the non-fixed-position feature are two types of features in nature, and do not affect each other. Therefore, depending on the actual situation, this situation can also be subdivided into several subcases:
firstly, identification operations based on two different characteristics are sequentially carried out in a serial mode, and the identification operation for limiting the latter characteristic is established under the condition that the result of the former characteristic identification operation is not unique, so that unnecessary identification operations are simplified as much as possible in the serial mode;
secondly, the identification operations based on two different characteristics are sequentially carried out in a serial mode, but the identification operations are certainly and respectively carried out on the two different characteristics, so that the final file type determination is finished by combining the two identification results;
thirdly, the recognition operations based on two different characteristics are carried out simultaneously in parallel (which can be realized by multithreading), since the parallel execution of the final file type step necessarily requires the combination of the recognition results of the two characteristics.
It should be further noted that, the first two cases do not limit the execution sequence of the identification operations corresponding to the two features in the serial manner, and can be flexibly selected according to the actual situation.
S103: and determining the file type of the file to be classified according to the first recognition result and/or the second recognition result.
This step is intended to determine the file type of the file to be classified according to the specific manner selected in step S102 and the corresponding recognition result obtained.
According to the file type determining method provided by the embodiment of the application, the method is different from the prior art, the file type is recognized from the file to be classified deeper through the position fixed feature and/or the position non-fixed feature, the file type is determined only on the basis of the surface layer and the suffix easy to modify in the prior art, and the position fixed feature and the position non-fixed feature are generated by the file of the corresponding file type according to the preset standard, so that the active modification difficulty is extremely high, the accuracy of the file type determined on the basis of the feature can be well guaranteed, and the invasion of malicious files disguised through the file type can be effectively prevented.
It should be noted that, the foregoing embodiment introduces a basic solution for achieving the purpose of the present solution, and based on a basic solution provided by the first embodiment, the present application further provides a more specific implementation manner or a preferred improvement solution in a specific application scenario for some steps of the first embodiment through other subsequent embodiments, and subsequent descriptions related to a same step or corresponding steps as those in the first embodiment may refer to each other and also have the same beneficial effect, and repeated steps will not be described in further detail in the subsequent embodiments.
On the basis of the previous embodiment, in order to achieve the accuracy of determining the file type with relatively good effect under the condition that the number of occupied computing resources is as small as possible, for S102 and S103, this embodiment provides a method for identifying the file type of the file to be classified in a serial manner sequentially through the position-fixed feature and the position-non-fixed feature through fig. 6, which includes the following steps:
s201: executing file type identification operation on files to be classified by utilizing a first identification library to obtain a first identification result;
s202: judging whether the first identification result can uniquely determine the file type of the file to be classified, if so, executing S203, otherwise, executing S204;
s203: uniquely determining the file type of the file to be classified according to the first identification result;
in this step, based on the determination result of S202 that the file type of the file to be classified can be uniquely determined according to the first identification result, that is, only the position fixing feature of the matched unique file type is found in the file to be classified through the first corresponding relationship of the records in the first identification library. In other words, the first identification result only includes a unique candidate file type, for example, only the file type of the file to be classified is JPG in the first identification result obtained through the file type identification operation.
S204: executing file type identification operation on the files to be classified by utilizing a second identification library to obtain a second identification result;
this step is based on the determination result of S202 that the document to be classified has multiple candidate document types according to the first identification result, that is, the case where the feature corresponding to multiple document types exists is determined by the position fixing feature (that is, the case where one position fixing feature itself has a one-to-many document type, and also includes the case where each of the multiple position fixing features corresponds to a different document type).
Since a plurality of candidate file types are determined by the position fixing feature, in order to reduce the candidate range and improve the accuracy as much as possible, in this embodiment, in this case, the file type identification operation is performed again by the position non-fixing feature, so as to improve the accuracy of the result by combining the two identification results.
Further, in S204, the document type recognition operation is performed on the document to be classified again by reusing the second recognition library, rather than performing targeted reconfirmation on the multiple candidate document types determined in S201, so as to prevent possible omission of the document type recognition operation performed by S201 through the position fixing feature. In case of complete prevention of omission, it is also possible to specifically reconfirm only the plurality of candidate file types determined in S201, without this having to be done for all recorded position-free features in the second recognition library.
S205: and determining the file type of the file to be classified by using the first recognition result and the second recognition result.
According to the method, the file type identification operation based on the position fixed features is firstly executed on the files to be classified by utilizing the first identification library, if the obtained first identification result cannot uniquely determine the file type, the file type identification operation based on the position non-fixed features is executed on the files to be classified by utilizing the second identification library, and finally the two identification results are integrated to obtain the file type determination result with higher accuracy.
According to the scheme provided by the embodiment, the second judgment is carried out only when the first judgment result is not unique, so that some invalid calculation amount can be avoided on the whole, the occupation of calculation resources is reduced as much as possible, the time consumption is shortened, and the accuracy of the determination result is higher in most application scenes.
To facilitate understanding of the solution of the present embodiment, a flowchart of a more specific implementation method is provided below by fig. 7, which includes the following steps:
s301: respectively using each position fixed feature contained in the first recognition library to perform feature matching operation on the files to be classified to obtain matched position fixed features;
s302: determining the file type corresponding to the fixed characteristic of the matching position as a first matching file type according to the first corresponding relation;
since the first correspondence relationship recorded in the first recognition library is a correspondence relationship between each file type and each position fixing feature, the essence of the file type recognition operation is a matching operation between the recorded position fixing feature in the first recognition library and the actual position fixing feature contained in the file to be classified.
S303: judging whether the number of the first matching file types is larger than 1, if so, executing S304, otherwise, executing S305;
corresponding to S202, this step is to determine whether the number of the first matching file types is greater than 1.
S304: determining the only first matching type as the file type of the file to be classified;
s305: respectively using each position non-fixed feature contained in the second recognition library to perform feature matching operation on the file to be classified to obtain a matched position non-fixed feature;
s306: determining the file type corresponding to the non-fixed characteristic of the matching position as a second matching file type according to the second corresponding relation;
as in the explanation of S301 and S302, the essence of the file type recognition operations performed by these two parts is the matching operation between the recorded features and the actual features, the former being fixed-position features and the latter being non-fixed-position features.
S307: and determining the file type of the file to be classified according to the first matching file type and the second matching file type.
On the basis of the previous embodiment, in a case that the number of the first matching file types is multiple (greater than 1), this embodiment provides a scheme for evaluating whether the candidate file types can be determined as the real file types of the files to be classified according to the number and/or length of the features corresponding to each candidate file type through fig. 8, so as to provide a rigorous result in this way, including the following steps:
s401: calculating to obtain a first probability according to the number and/or length of the fixed features of the matching positions of each first matching file type, and sorting according to the probability to obtain a first sorting table;
s402: calculating according to the number and/or length of the non-fixed features of the matching position of each second matching file type to obtain a second probability, and sequencing according to the probability to obtain a second sequencing table;
the number refers to the number of the fixed/non-fixed characteristics of the matching position corresponding to each first/second matching file type, and the larger the number is, the larger the probability that the corresponding first/second matching file type is the real file type is; the length refers to the data length of the fixed/non-fixed characteristic of the matching position corresponding to each first/second matching file type, and the data length may be the length occupied by various expressions of meaningful content, and is usually the length of a character string, and the longer the length is, the higher the probability that the corresponding first/second matching file type is the real file type is.
S403: accumulating the first probability and the second probability of the same file type in the first sorting table and the second sorting table to obtain a processed sorting table;
on the basis of S401 and S402, in this embodiment, probability ranks obtained by two different features are integrated in an accumulation manner, so as to obtain a processed rank list obtained by integrating the two features.
S405: and taking each matched file type sorted according to the possibility in the processed sorting table as a possible file type of the file to be classified.
The final file type determination result output for a file to be classified may be: "xlsx/xls/zip", from the output result, it can be seen that, by applying the scheme provided in the present application, it is determined that the document to be classified may belong to three types of documents, xlsx, xls and zip, respectively, and each type is separated by "/", and it is assumed that the three types are ordered from left to right according to the probability from high to low, which indicates that the document type of the document to be classified is most likely xlsx, xls, and zip. This is because xlsx and xls and zip have in common on fixed and non-fixed features, and because there are similarities in the underlying structures that form the document, and xlsx is an advanced version of xls document type, there are also many similarities in both.
Further, in terms of how to perform probability ranking according to quantity and length features simultaneously, the application also specifically provides a method for realizing weighted scoring through a weighted calculation method and performing probability ranking according to size based on weighted scores, which comprises the following steps:
carrying out weighted calculation on each first matching file type according to the number and/or length of the corresponding fixed features of the matching positions by using a weighted calculation method to obtain each weighted score; wherein, the number corresponds to a preset first weight value, and the length corresponds to a preset second weight value;
and sequencing the weighted scores according to a preset sequencing mode to obtain a first sequencing table (the generation process of the second sequencing table is the same, and the generation process of the first sequencing table is taken as an example here).
Furthermore, compared to the quantity characteristic, in the case that there is a file type that the application upgrade creates a higher level (for example, the file type of the EXCEL version is xls below 2007, and xlsx above), it is found that the difference between the two is not reflected in the quantity, but is in the length of the same characteristic, for example, both have a fixed position characteristic at the same fixed position, assuming that xls is aaa bbb, and xlsx is often aaa bbb ccc, that is, the file type of the higher level version will be added with a part on the basis of the original characteristic character string to distinguish in this way. In this case, the two can not be quantitatively distinguished, so that the second weight corresponding to the length is properly increased by adopting a weighting calculation method, so that the sorting of the results can be more correct through the difference of the weights.
On the basis of any of the above embodiments, even if the identification is performed by the position fixed feature and/or the position non-fixed feature, there may be a case that there is no alternative file type in the identification results of both, and for this case, the present application provides a solution including but not limited to:
when the file type of the file to be classified cannot be determined according to the first recognition result and/or the second recognition result, performing text file feature recognition operation on the file to be classified to obtain a third recognition result;
and determining whether the file to be classified is a plain text file or not according to the third recognition result.
The above solution is that in a practical application scenario, it is found that a plain text record carrier such as a notepad has a very simple data structure, does not include too many features for confirming the file type, and only includes text information recorded therein, so that it is tried to encode and decode through text encoding and decoding methods such as UTF-8\ ASCII \ UFT-8 without Bom, and it is determined whether meaningful text content can be correctly obtained, and if one of them can obtain meaningful text content, it can be stated that the file to be classified is a plain text file.
On the basis of any of the above embodiments, if a file type is determined by a file suffix while a file to be classified is received, but in order to determine whether the file type corresponding to the file suffix is real or not, and further determine whether malicious content exists, the present application further provides a feasible scheme:
marking the file type determined according to the first recognition result and/or the second recognition result as a first type;
judging whether the second type is consistent with the first type; the second type is a file type determined according to a suffix of the file to be classified;
if not, attaching an abnormal mark to the file to be classified, and executing malicious content detection operation on the file to be classified attached with the abnormal mark.
If the first type is not consistent with the second type (when the first type is multiple, the second type is consistent with at least one of the types and can be considered to be consistent), the file type given by the file suffix is wrong, the file to be classified is considered to try to bypass the file type detection mechanism by modifying the file suffix, so that the file to be classified possibly contains malicious content, and therefore an abnormal mark is attached to the file to purposefully perform malicious content detection operation.
Through the linkage of the novel file type determination scheme and the malicious content detection mechanism, the whole system has better prevention capability on the malicious content which is attempted to be invaded in a mode of modifying the file type.
Because the situation is complicated and cannot be illustrated by a list, a person skilled in the art can realize that many examples exist according to the basic method principle provided by the application and the practical situation, and the protection scope of the application should be protected without enough inventive work.
Referring to fig. 9, fig. 9 is a block diagram of a file type determining apparatus according to an embodiment of the present application, where the apparatus may include:
a to-be-classified file receiving unit 100 for receiving an incoming to-be-classified file;
the file type identification unit 200 is configured to perform a file type identification operation on the file to be classified by using a preset first identification library and/or a preset second identification library, and correspondingly obtain a first identification result and/or a second identification result; the first identification library records a first corresponding relation between each file type and each position fixed feature, and the second identification library records a second corresponding relation between each file type and each position non-fixed feature;
a file type determining unit 300, configured to determine a file type of the file to be classified according to the first recognition result and/or the second recognition result.
The file type identifying unit 200 may include:
the first identification subunit is configured to execute the file type identification operation on the file to be classified by using the first identification library to obtain the first identification result;
the second identification subunit is configured to, when the file type of the file to be classified cannot be uniquely determined according to the first identification result, perform the file type identification operation on the file to be classified by using the second identification library to obtain a second identification result;
and the file type common identification subunit is used for determining the file type of the file to be classified by using the first identification result and the second identification result.
Wherein the first identification subunit may include:
the position fixed feature matching module is used for performing feature matching operation on the file to be classified by using each position fixed feature contained in the first identification library to obtain a matched position fixed feature;
the first matching file type determining module is used for determining the file type corresponding to the matching position fixed feature as a first matching file type according to the first corresponding relation;
correspondingly, the second identifying subunit may include:
the position non-fixed feature matching module is used for performing feature matching operation on the file to be classified by using each position non-fixed feature contained in the second identification library when the number of the first matching file types is greater than 1 to obtain a matching position non-fixed feature;
the second matching file type determining module is used for determining the file type corresponding to the non-fixed characteristic of the matching position as a second matching file type according to the second corresponding relation;
correspondingly, the file type common identification subunit may include:
and the file type common determination module is used for determining the file type of the file to be classified according to the first matching file type and the second matching file type.
Wherein the file type common determination module may include:
the first sequencing submodule is used for calculating to obtain a first probability according to the number and/or the length of the fixed features of the matching positions of each first matching file type, and sequencing according to the probability to obtain a first sequencing table;
the second sorting submodule is used for calculating to obtain a second probability according to the number and/or the length of the non-fixed features of the matching position of each second matching file type, and sorting according to the probability to obtain a second sorting table;
the accumulation processing submodule is used for accumulating the first probability and the second probability of the same file type in the first sorting table and the second sorting table to obtain a processed sorting table;
and the possible file type determining submodule is used for taking each matched file type in the processed sorting table as a possible file type of the file to be classified.
Wherein the first ordering sub-module may include:
the weighting calculation component is used for carrying out weighting calculation on each first matching file type according to the number and/or length of the corresponding fixed features at the matching positions by using a weighting calculation method to obtain each weighting score; wherein, the number corresponds to a preset first weight value, and the length corresponds to a preset second weight value;
and the score sorting component is used for sorting the weighted scores according to the sizes to obtain a first sorting table.
Further, the file type determining apparatus may further include:
the text file feature recognition unit is used for executing text file feature recognition operation on the file to be classified to obtain a third recognition result when the file type of the file to be classified cannot be determined according to the first recognition result and/or the second recognition result;
and the plain text file determining unit is used for determining whether the file to be classified is a plain text file according to the third identification result.
Further, the file type determining apparatus may further include:
the first type determining unit is used for marking the file type determined according to the first recognition result and/or the second recognition result as a first type;
the type consistency judging unit is used for judging whether a second type is consistent with the first type; the second type is a file type determined according to a suffix of the file to be classified;
and the abnormal mark attaching and malicious content detecting unit is used for attaching an abnormal mark to the file to be classified and executing malicious content detecting operation on the file to be classified attached with the abnormal mark when the second type is inconsistent with the first type.
The present embodiment exists as an apparatus embodiment corresponding to the above method embodiment, and has all the beneficial effects of the method embodiment, and details are not repeated here.
For a deeper understanding, the present application also shows how the document type determining apparatus provided in the present application functions and functions in a specific application scenario, and can be seen in a schematic diagram as shown in fig. 10:
in the application scenario embodiment, the file type determining apparatus is used as an apparatus for determining the type of a file that a client wants to upload to a storage device for storage, and it can be seen that:
in fig. 10, two different types of uploading client 10 are shown on the left, respectively mobile and stationary, with the file type determining device 20 provided in the middle and the storage device 30 on the right.
In both actions, the file type determining device 20 detects that the file 1 and the file 2 have a file type suffix identifying the file type in the scenario, where the mobile uploading client wants to upload the file 1 to the storage device 30 for storage, and the fixed uploading client wants to upload the file 2 to the storage device 30 for storage.
The file type determining apparatus 20 detects whether the files 1 and 2 belong to file types that can be uploaded to the storage device 30 for storage according to any determining method provided in the foregoing embodiment of the present application, and finally determines that the file type determined by the file 1 is consistent with the file type suffix thereof, and the file type belongs to a file type that can be uploaded to the storage device 30 for storage, and the file type determined by the file 2 is not consistent with the file type suffix thereof, so that the file type is determined to be an abnormal file and is not allowed to be uploaded to the storage device 30 for storage.
Based on the foregoing embodiments, the present application further provides an electronic device, where the electronic device may include a memory and a processor, where the memory stores a computer program, and when the processor calls the computer program in the memory, the electronic device may implement the steps of the file type determining method provided in the foregoing embodiments. Of course, the electronic device may also include various necessary network interfaces, power supplies, other components, and the like.
Fig. 11 is a schematic structural diagram of an electronic device 400, where the electronic device 400 includes a memory 410, a processor 420, and a bus 430, the memory 410 stores a file type determining program that can run on the processor 420, the file type determining program is transmitted to the processor 420 through the bus 430, and when being executed by the processor 420, the file type determining program can implement the steps in the file type determining method described in the above embodiment.
The memory 410 includes at least one type of readable storage medium, which includes flash memory, hard disk, multi-media card, card type memory (e.g., SD or DX memory, etc.), magnetic memory, magnetic disk, optical disk, and the like. The memory 410 may be an internal storage unit of the electronic device 400, such as a hard disk of the electronic device 400, in some embodiments. The memory 410 may also be an external storage device of the electronic device 400 in other embodiments, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), etc. provided on the electronic device 400. Further, the memory 410 may also be simultaneously composed of an internal storage unit and an external storage device. Further, the memory 410 may be used not only to store various application software and various types of data installed in the electronic device 400, but also to temporarily store data that has been output or will be output.
Processor 420, which in some embodiments may be a Central Processing Unit (CPU), controller, microcontroller, microprocessor or other data Processing chip, operates program code or processes data, such as file type determination programs, stored in memory 410.
The bus 430 may be a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one bi-directional hollow indicator line is shown in FIG. 11, but does not indicate only one bus or one type of bus.
In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other manners. For example, the above-described apparatus embodiments are merely illustrative, and for example, a division of a unit is merely a logical division, and an actual implementation may have another division, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.
Units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.
The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be substantially implemented or contributed to by the prior art, or all or part of the technical solution may be embodied in a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method provided in the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
It should be noted that the above-mentioned numbers of the embodiments of the present invention are merely for description, and do not represent the merits of the embodiments. And the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, apparatus, article, or method that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, apparatus, article, or method. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, apparatus, article, or method that includes the element.
The above description is only a preferred embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes, which are made by using the contents of the present specification and the accompanying drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims (10)

1. A method for determining a file type, comprising:
receiving an incoming file to be classified;
executing file type identification operation on the file to be classified by utilizing a preset first identification library and/or a preset second identification library to correspondingly obtain a first identification result and/or a second identification result; the first identification library records a first corresponding relation between each file type and each position fixed feature, and the second identification library records a second corresponding relation between each file type and each position non-fixed feature;
and determining the file type of the file to be classified according to the first recognition result and/or the second recognition result.
2. The method for determining the file type according to claim 1, wherein the step of performing a file type identification operation on the file to be classified by using a preset first identification library and/or a preset second identification library to obtain a first identification result and/or a second identification result correspondingly, and the step of determining the file type of the file to be classified according to the first identification result and/or the second identification result comprises:
executing the file type identification operation on the file to be classified by utilizing the first identification library to obtain a first identification result;
when the file type of the file to be classified cannot be determined uniquely according to the first identification result, executing the file type identification operation on the file to be classified by using the second identification library to obtain a second identification result;
and determining the file type of the file to be classified by using the first recognition result and the second recognition result.
3. The method for determining the file type according to claim 2, wherein the performing the file type identification operation on the file to be classified by using the first identification library to obtain the first identification result comprises:
performing feature matching operation on the file to be classified by using each position fixed feature contained in the first identification library to obtain a matched position fixed feature;
determining the file type corresponding to the fixed characteristic of the matching position as a first matching file type according to the first corresponding relation;
correspondingly, when the file type of the file to be classified cannot be uniquely determined according to the first identification result, the method comprises the following steps:
when the number of the first matching file types is greater than 1;
correspondingly, the file type identification operation is executed on the file to be classified by using the second identification library to obtain the second identification result, and the method comprises the following steps:
performing feature matching operation on the file to be classified by using each position non-fixed feature contained in the second identification library to obtain a matched position non-fixed feature;
determining the file type corresponding to the non-fixed characteristic of the matching position as a second matching file type according to the second corresponding relation;
correspondingly, determining the file type of the file to be classified according to the first recognition result and the second recognition result, including:
and determining the file type of the file to be classified according to the first matching file type and the second matching file type.
4. The method for determining the file type according to claim 3, wherein determining the file type of the file to be classified according to the first matching file type and the second matching file type comprises:
calculating to obtain a first probability according to the number and/or length of the fixed features of the matching positions of each first matching file type, and sorting according to the probability to obtain a first sorting table;
calculating according to the number and/or length of the non-fixed features of the matching position of each second matching file type to obtain a second probability, and sequencing according to the probability to obtain a second sequencing table;
accumulating the first probability and the second probability of the same file type in the first sorting table and the second sorting table to obtain a processed sorting table;
and taking each matched file type in the processed sorting table as a possible file type of the file to be classified.
5. The method according to claim 4, wherein the step of calculating a first probability according to the number and/or length of the fixed features of the matching positions of each of the first matching file types, and sorting according to the probability to obtain a first sorting table comprises:
carrying out weighted calculation on each first matching file type according to the number and/or length of the corresponding fixed features of the matching positions by using a weighted calculation method to obtain each weighted score; wherein, the number corresponds to a preset first weight value, and the length corresponds to a preset second weight value;
and sorting the weighted scores according to sizes to obtain the first sorting table.
6. The method according to any one of claims 1 to 5, wherein when the file type of the file to be classified cannot be determined according to the first recognition result and/or the second recognition result, the method further comprises:
executing text file feature recognition operation on the file to be classified to obtain a third recognition result;
and determining whether the file to be classified is a plain text file or not according to the third recognition result.
7. The file type determination method according to claim 6, further comprising:
marking the file type determined according to the first recognition result and/or the second recognition result as a first type;
judging whether the second type is consistent with the first type; the second type is a file type determined according to a suffix of the file to be classified;
if not, attaching an abnormal mark to the file to be classified, and executing malicious content detection operation on the file to be classified attached with the abnormal mark.
8. A file type determining apparatus, comprising:
the file receiving unit to be classified is used for receiving the incoming file to be classified;
the file type identification unit is used for executing file type identification operation on the file to be classified by utilizing a preset first identification library and/or a preset second identification library to correspondingly obtain a first identification result and/or a second identification result; the first identification library records a first corresponding relation between each file type and each position fixed feature, and the second identification library records a second corresponding relation between each file type and each position non-fixed feature;
and the file type determining unit is used for determining the file type of the file to be classified according to the first recognition result and/or the second recognition result.
9. An electronic device, comprising:
a memory for storing a computer program;
a processor for implementing the file type determination method of any one of claims 1 to 7 when executing the computer program.
10. A readable storage medium, characterized in that a computer program is stored therein, which computer program, when being executed by a processor, realizes the file type determination method according to any one of claims 1 to 7.
CN201911083245.4A 2019-11-07 2019-11-07 File type determination method and device, electronic equipment and readable storage medium Pending CN110825701A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911083245.4A CN110825701A (en) 2019-11-07 2019-11-07 File type determination method and device, electronic equipment and readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911083245.4A CN110825701A (en) 2019-11-07 2019-11-07 File type determination method and device, electronic equipment and readable storage medium

Publications (1)

Publication Number Publication Date
CN110825701A true CN110825701A (en) 2020-02-21

Family

ID=69553275

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911083245.4A Pending CN110825701A (en) 2019-11-07 2019-11-07 File type determination method and device, electronic equipment and readable storage medium

Country Status (1)

Country Link
CN (1) CN110825701A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112738085A (en) * 2020-12-28 2021-04-30 深圳前海微众银行股份有限公司 File security verification method, device, equipment and storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102571767A (en) * 2011-12-24 2012-07-11 成都市华为赛门铁克科技有限公司 File type recognition method and file type recognition device
CN108460155A (en) * 2018-03-28 2018-08-28 深信服科技股份有限公司 A kind of file identification method, device, equipment and storage medium
CN108768921A (en) * 2018-03-28 2018-11-06 中国科学院信息工程研究所 A kind of malicious web pages discovery method and system of feature based detection
CN110134644A (en) * 2019-05-17 2019-08-16 成都卫士通信息产业股份有限公司 File type identification method, device, electronic equipment and readable storage medium storing program for executing

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102571767A (en) * 2011-12-24 2012-07-11 成都市华为赛门铁克科技有限公司 File type recognition method and file type recognition device
CN108460155A (en) * 2018-03-28 2018-08-28 深信服科技股份有限公司 A kind of file identification method, device, equipment and storage medium
CN108768921A (en) * 2018-03-28 2018-11-06 中国科学院信息工程研究所 A kind of malicious web pages discovery method and system of feature based detection
CN110134644A (en) * 2019-05-17 2019-08-16 成都卫士通信息产业股份有限公司 File type identification method, device, electronic equipment and readable storage medium storing program for executing

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112738085A (en) * 2020-12-28 2021-04-30 深圳前海微众银行股份有限公司 File security verification method, device, equipment and storage medium
CN112738085B (en) * 2020-12-28 2023-08-08 深圳前海微众银行股份有限公司 File security verification method, device, equipment and storage medium

Similar Documents

Publication Publication Date Title
RU2680738C1 (en) Cascade classifier for the computer security applications
CN110826006B (en) Abnormal collection behavior identification method and device based on privacy data protection
US9798981B2 (en) Determining malware based on signal tokens
US20160156646A1 (en) Signal tokens indicative of malware
CN109344611B (en) Application access control method, terminal equipment and medium
CN110866258B (en) Rapid vulnerability positioning method, electronic device and storage medium
US8490861B1 (en) Systems and methods for providing security information about quick response codes
CN113010116B (en) Data processing method, device, terminal equipment and readable storage medium
CN114338102B (en) Security detection method, security detection device, electronic equipment and storage medium
US10742668B2 (en) Network attack pattern determination apparatus, determination method, and non-transitory computer readable storage medium thereof
CN110825701A (en) File type determination method and device, electronic equipment and readable storage medium
CN108804917B (en) File detection method and device, electronic equipment and storage medium
US20210073382A1 (en) Managing Virus Scanning of Container Images
US9842018B2 (en) Method of verifying integrity of program using hash
CN110737894B (en) Composite document security detection method and device, electronic equipment and storage medium
CN116955522A (en) Sensitive word detection method, device, equipment and storage medium
US8464343B1 (en) Systems and methods for providing security information about quick response codes
CN114513341B (en) Malicious traffic detection method, malicious traffic detection device, terminal and computer readable storage medium
CN109858289A (en) The mobile storage device management method and device used suitable for corporate intranet
CN114329464A (en) Anti-virus engine detection method and device, electronic equipment and storage medium
CN113806737A (en) Malicious process risk level evaluation method, terminal device and storage medium
CN114710468A (en) Domain name generation and identification method, device, equipment and medium
CN110413871B (en) Application recommendation method and device and electronic equipment
CN110990665A (en) Data processing method, device, system, electronic equipment and storage medium
CN111563276A (en) Webpage tampering detection method, detection system and related equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20200221

RJ01 Rejection of invention patent application after publication