CN112685740A

CN112685740A - Compressed packet security detection method, device, terminal and storage medium

Info

Publication number: CN112685740A
Application number: CN201910990096.3A
Authority: CN
Inventors: 刘博�
Original assignee: ZTE Corp
Current assignee: ZTE Corp
Priority date: 2019-10-17
Filing date: 2019-10-17
Publication date: 2021-04-20

Abstract

The invention provides a method, a device, a terminal and a storage medium for detecting the safety of a compression packet, which are used for reading the compression packet, decompressing the compression packet and obtaining each file; identifying each File based on a File type identification model File-RNN of the AI, and determining the type of a target File to which each File belongs; when the type of a target file to which at least one file belongs is different from the suffix name of the file, a warning is output. In some embodiments, the file is detected by using the file type identification technology of AI, so that the security vulnerability of a compressed packet which may carry sensitive information files can be detected, and the method is efficient and accurate; when the type of the file is inconsistent with the suffix of the file, the file is considered as an illegal file, and the sensitive information file possibly existing in the compressed packet can be efficiently and comprehensively filtered out through warning, so that the safety risk of an enterprise is reduced.

Description

Compressed packet security detection method, device, terminal and storage medium

Technical Field

The embodiment of the invention relates to the field of file detection, in particular to a method, a device, a terminal and a storage medium for compressed packet security detection.

Background

In a Window (or Linux) environment, a file named test.doc is changed into test.java, the file is copied to an src directory of a Java Web project, if the test.java file is not referred to in a code, a Java Web application program can be normally compiled and passed, and a compressed package capable of being released, such as a WAR package, is generated. Thus, the problem of compiling vulnerability exists, so that sensitive information is possibly leaked; even if a file header detection tool is used for detecting the file type, the safety is not guaranteed, and the purpose of confusion is achieved by tampering the file header by using a binary tool; the above-described vulnerability to compilation of compressed packets is fatal to the enterprise, because the security of the enterprise is compromised and the losses to the enterprise are non-limiting.

Disclosure of Invention

The embodiment of the invention provides a compression package safety detection method, a compression package safety detection device, a compression package safety detection terminal and a compression package safety detection storage medium, and mainly solves the technical problem that when a relevant version is released in a compression package mode, a compression package possibly carries sensitive files to be released to a cloud server, so that the important safety problem of information leakage occurs.

In order to solve at least the above technical problem, an embodiment of the present invention provides a method for detecting security of a compressed packet, including: reading a compressed packet, and decompressing the compressed packet to obtain each file; the File type identification model RRN based on AI detects File type File-RNN to identify each File and determine the target File type of each File; when the type of a target file to which at least one file belongs is different from the suffix name of the file, a warning is output.

The embodiment of the invention also provides a safety detection device, which comprises a decompression module, a File-RNN model identification module and a detection module; the decompression module is used for reading the compressed packets and decompressing the compressed packets to obtain each file; the File-RNN model identification module is used for identifying each File based on an AI File type identification model File-RNN and determining the target File type of each File; the detection module is used for outputting a warning when the type of the target file to which at least one file belongs is different from the suffix name of the file.

An embodiment of the present invention further provides a terminal, including: a processor, a memory, and a communication bus; the communication bus is used for realizing connection communication between the processor and the memory; the processor is configured to execute one or more computer programs stored in the memory to implement the steps of the compressed packet security detection method according to any one of the above.

Embodiments of the present invention further provide a storage medium, where one or more computer programs are stored, and the one or more computer programs may be executed by one or more processors to implement the steps of the compressed packet security detection method according to any one of the above descriptions.

The beneficial effects of the invention at least comprise:

the invention provides a method, a device, a terminal and a storage medium for detecting the safety of a compressed packet, which are used for reading the compressed packet and decompressing the compressed packet to obtain each file; identifying each File based on a File type identification model File-RNN of the AI, and determining the type of a target File to which each File belongs; when the type of a target file to which at least one file belongs is different from the suffix name of the file, a warning is output. In some embodiments, the file is detected by using the file type identification technology of AI, so that the security vulnerability of a compressed packet which may carry sensitive information files can be detected, and the method is efficient and accurate; when the type of the target file to which the file belongs is not consistent with the suffix of the file, the file is regarded as an illegal file, and sensitive information files possibly existing in the compressed packet can be efficiently and comprehensively filtered out through warning, so that the safety risk of an enterprise is reduced.

Additional features and corresponding advantages of the invention will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention.

Drawings

Fig. 1 is a schematic flowchart of a compressed packet security detection method according to an embodiment of the present invention;

FIG. 2 is a schematic flow chart of a File-RNN model training method according to a second embodiment of the present invention;

fig. 3 is a flowchart of a file slicing module according to a second embodiment of the present invention;

fig. 4 is a flowchart of a compressed packet security detection mechanism based on AI file type identification according to a third embodiment of the present invention;

fig. 5 is a flowchart of a file feature extraction module according to a third embodiment of the present invention;

fig. 6 is a schematic structural diagram of a safety detection device according to a fourth embodiment of the present invention;

FIG. 7 is a schematic structural diagram of a File-RNN model identification module according to a fourth embodiment of the present invention;

fig. 8 is a schematic structural diagram of a terminal according to a fourth embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, embodiments of the present invention are described in detail below with reference to the accompanying drawings. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

The first embodiment is as follows:

in order to at least solve the serious safety problem of information leakage caused by the fact that sensitive files may be carried by a compressed package and issued to a cloud server when an enterprise server issues a version in the form of the compressed package, the embodiment of the invention provides a compressed package mechanism based on AI file type identification. The core of the mechanism lies in a file type identification model of AI, samples of the model are collected by using a 'slicing method', and the method can efficiently collect sample information and can widely cover text characteristics of files, so that a loss function of the model can better converge to a local optimal solution; meanwhile, the embodiment of the invention provides a special file characteristic structure, so that the model is more pertinent when learning the text characteristics.

Based on the model, the compressed packet is decompressed before being released, and all decompressed files are detected by using the model, and the model gives the type of each file and the probability value of the type. A warning is given when the result does not coincide with the file's naming suffix.

Through the mechanism, the problem that the compressed packet carries the sensitive information file can be effectively solved, and therefore the risk of leakage of the sensitive information of an enterprise is reduced. Referring to fig. 1, the present embodiment provides a method for detecting security of a compressed packet, which includes the following steps.

And step S101, reading the compressed packet, and decompressing the compressed packet to obtain each file.

The compressed packet includes, but is not limited to, a WAR packet and a JAR packet, and in the embodiment of the present invention, the compressed packet is described as the WAR packet; after the WAR package is compiled by the Java Web application, the WAR package is read and decompressed into files by using a decompression tool, wherein the files are files which are not subjected to the security detection of the compressed package provided by the embodiment of the invention.

And S102, identifying each File based on the File type identification model File-RNN of the AI, and determining the target File type of each File.

In the embodiment of the invention, an AI File type identification technology is used for identifying undetected files, and a specific File-RNN model (the File distingguish with RNN (recurrent neural network) detects the File type) can identify the information of the File to judge the type of the File. It is worth noting that because the File information amount is too large, in order to ensure the high efficiency of File detection, before identifying the File based on the File-RNN model, the File content can be filtered, and the key information can be extracted; specifically, each file is converted into a binary stream, invalid information in each file is filtered, and the feature information of each file is extracted from the filtered binary stream according to a preset text feature structure. The reason why the file is converted into the binary stream is that the efficiency of extracting text features and making vectors is not as good as that of the binary stream in other ways such as byte stream, and because meaningless invalid information such as meaningless symbols, invalid words, encoded messy codes, information content without time stamps may be included in the file, stream codes corresponding to the invalid information are filtered from the binary stream, so that the efficiency of extracting feature information of the file is improved.

It should be noted that, the extraction of the feature information of the file depends on a file feature structure, and the file feature structure may be defined in advance, and in the embodiment of the present invention, as shown in table 1 below, the defined file feature structure may include, but is not limited to: the method comprises file coding and unit-length content coding, wherein the unit-length content coding can be divided into N sections, the N is determined by preset file definition types of files, and the N is a positive integer. The file encoding refers to the encoding format of the file, and the binary system of the same type of file is different due to the difference of the encoding format of the file, and the encoding format includes but is not limited to UTF-8 or GBK encoding and the like. The unit length content coding means that a file binary stream of the length is extracted within a well-defined length range, and N means the number of extracted segments, namely N segments of unit length content coding are used as the characteristic information of the file according to the file characteristic structure. In some embodiments, the file feature structure may be a file encoding, a file magic number, and a unit length content encoding. In one embodiment, the file magic number refers to the first 4 bytes of the binary form of the file, and the magic numbers of each type of file are usually different, and this field is used as a reference value in model training and does not participate in model training. For example, as shown in table 1, if a file of java type is specified to take out a binary stream within a length range of 200 bytes and 50 segments of unit length codes are taken, the file structure characteristics of java type and other text type files have 52 fields in total. For example, when the suffix of the decompressed file is jpg, 100 segments can be taken as the feature information of the file according to table 1 because the data of the picture file such as jpg is huge, and the file structure features of the picture type share 102 fields; 200 sections of the audio and video type file can be taken, and the file structure characteristics of the file share 202 fields and the like; different file types can define different structural characteristics, namely N can set different data according to different file types, so that the file characteristic structure has expandability.

TABLE 1

N-50 (the file type is java, txt, and other text files);

n is 100 (the file type is a picture file such as bmp and png);

n-200 (file type mkv, MP3, etc. audio video file);

after the feature information of the File is extracted, the feature information is packaged into a data structure object, and the data structure object also needs to be converted into a vector type which can be processed by a File-RNN model, wherein the vector type can be converted by word2vec and other models.

In the embodiment of the invention, the File-RNN model performs derivation calculation on the vector object, determines the File type of the File corresponding to the vector type and the probability value corresponding to the File type, and determines the File type corresponding to the maximum probability value and the target File type of the File. Specifically, when the special information of the vector type is received, the File-RNN model calculates a subset of which File type feature set the characteristic information belongs to according to parameters obtained by previous training through a forward propagation algorithm, and if the characteristic information belongs to a subset of a certain type, a probability value is calculated according to the coverage rate of the subset, wherein the probability value represents the probability that the File belongs to the type; similarly, if the feature information belongs to the subsets of multiple types, multiple probability values are given, and the probability value is larger, the probability that the file corresponding to the feature information belongs to the type is higher, so that the file type to which the file corresponding to the maximum probability value belongs is taken as the target file type to which the file belongs; wherein the file type feature set includes but is not limited to text type, picture type, audio video type. For example, a vector object 1 corresponding to a tet File suffix is obtained according to table 1, the vector object 1 is imported into a File-RNN model for calculation, and assuming that the probability that the vector object 1 belongs to a text type is 76% and the probability that the vector object 1 belongs to a picture type is 30%, the text type is used as a target File type with the tet File suffix.

It should be noted that the File-RNN model in the embodiment of the present invention is a finished product model obtained through multiple training, and before reading the WAR packet and decompressing the WAR packet to obtain each File, model training is performed on File samples of different types of known files according to the recurrent neural network RNN in the AI to obtain the File-RNN model, where the different types of known files refer to specific File types of a certain File and include various different File types, and may at least include, but not be limited to, three File types corresponding to table 1.

In one embodiment, determining the File-RNN model by RNN specifically includes collecting different types of known files, which may be collected by a terminal or a user; randomly slicing known files of each type to obtain File samples, extracting feature information of the File samples according to a preset File feature structure to obtain vector objects, then executing an RNN (radio network node) inference model on the vector objects according to AI (artificial intelligence) model training, and determining a File-RNN model; the vector object obtained according to the file feature structure of the file sample is as described above, and is not described herein again; performing cyclic training according to the traditional AI model training steps, wherein the training steps comprise: inputting a vector object of a trained file sample, executing an RNN inference model on the vector object, calculating loss, adjusting model parameters and the like; and stopping training when the value of the function to be lost reaches the optimal solution, and obtaining the finished product of the File-RNN model with the capability of identifying various File types.

It should be noted that, in order to improve the efficiency of model training, in the embodiment of the present invention, a File sample of a known File is provided for a File-RNN model by a slicing method, specifically, randomly slicing the known File means randomly cutting the entire File, and the content of cutting is continuous, where the randomly cutting means randomly selecting a starting point of the File and cutting the File as required; for example, randomly selecting a starting point a of the file to cut, and if the cutting end point is b, obtaining a cutting file 1 of 'a-b'; then cutting is carried out by taking b as a starting point, and the cutting end point is c, so that a cutting file 2 of 'b-c' is obtained, namely the cutting file 1 and the cutting file 2 are continuous; in the AI model training, besides the important importance of model modeling, the sample is equally important, the quality of the sample determines whether the model is available, the slicing method can fragment a large file, the sample manufacturing and model training efficiency can be improved, and the important characteristic information of the sample cannot be omitted due to the random cutting characteristic. In the embodiment of the present invention, because different file types correspond to different file feature structures, before randomly slicing a known file, the sizes and the numbers of slices can be respectively set for the known files of each type, and then the known files of each type are randomly sliced according to the sizes and the numbers of the slices, and the files obtained by the slicing are used as file samples. For example, having collected 1000 different java files, the module would set the java slices to size 10k, with 10 slices per file. Thus, there will be 10000 slices, each of 10k size; the 10000 slices are used as samples to be provided for a File-RNN model to carry out java File type training; this method is more efficient than directly sampling 1000 files, because if there are many repeated segments in the 1000 files or each file is very large, training time and hardware resources are wasted, and the training result does not necessarily reach the optimal solution. Of course, the sizes and the number of the correspondingly set slices of different types can be flexibly set according to actual requirements, for example, the size and the number of the slices of the audio and video file can be set to be larger.

In the embodiment of the invention, the randomly cutting the known files of various types according to the size and the number of the slices comprises the steps of randomly cutting the known files when the size of the known files is larger than a preset threshold value, filtering file headers of the known files, namely filtering file magic numbers (the first 4 bytes of a binary stream) in a file characteristic structure of the known files, randomly selecting starting points of the filtered known files, and cutting according to the size of the slices until the set number of the slices is reached. The preset threshold may be set by a user, for example, may be equal to the size of the slice, that is, when the size of the known file is larger than the size of the slice, the file is cut; of course, the preset threshold may be larger than the slice size. And when the size of the known file is smaller than the preset threshold value, slicing operation is needed, and the whole known file is used as a file sample.

S103, when the type of the target file to which at least one file belongs is different from the suffix name of the file, outputting a warning.

It will be appreciated that when a target document type to which a document belongs is different from the suffix noun of the document, it indicates that the document may be modified, the document is treated as an "illegal" document, and the document is alarm marked and placed in an alarm list; when at least one illegal file exists in the warning list, warning is output to prompt, such as sounding, flashing light warning and the like; in some embodiments, when the target file type to which the file belongs is different from the suffix name of the file, a warning is output while the WAR package is prohibited from issuing. Of course, the manner of outputting the warning is not limited to the above manner, and any other manner capable of notifying or reminding the user or the device is applicable.

The invention provides a compressed packet safety detection method, which provides an AI-based file type identification model, wherein the model can identify the type of any file and give the probability of the file type after multiple times of model training of a system by carrying out sample collection on various file types and carrying out characteristic extraction according to a defined file characteristic structure in training, and based on the AI-based file type identification model, all files in a decompressed WAR packet are subjected to type check, the type and the probability of each file are output, if the type of a certain file(s) is detected to be inconsistent with the suffix of the file(s), the file is considered as an illegal file, and the mechanism gives a warning and prevents the compressed packet from being issued. The method for detecting the files by using the AI is efficient and accurate, the improvement of the safety of system software products based on the AI technology is innovative in the relevant field, the AI technology can solve the complex problem and prevent the leakage of sensitive information, and the mechanism can efficiently and comprehensively filter out the sensitive information files possibly existing in a compression packet and reduce the safety risk of enterprises.

Example two:

in order to facilitate understanding, the embodiment of the invention provides a File-RNN model training method, which comprises the steps of firstly collecting files of all File types to be identified, using the output contents of the files passing through a File slicing module as samples of the File-RNN model, then performing cyclic training according to the traditional AI model training steps of inputting training samples, executing an inference model on the training samples, calculating loss, adjusting model parameters and the like, stopping training when the value of a loss function reaches an optimal solution, and enabling the model to have the capability of identifying various File types; as shown in FIG. 2, the File-RNN model training method comprises the following steps:

step S201: the files are classified (assuming n types), and each type of file is grouped into a set, denoted S_n. And for each set S_i(i ═ 1.. n) adding a training flag, which if marked as 1, represents that training is complete; if the flag is 0, it represents untrained. Step 202 is performed.

Step S202: detection S_iIf the flag value is 0, go to step S203; if S_iIf the flag value of (i ═ 1.. n) is 1, step S208 is executed.

Step S203: fetch set S_iAnd inputting the files into a file slicing module, wherein the number of samples output from the slicing module is recorded as q, and each sample is recorded as L_k(k ═ 1.. q), and is L_kAdding a training mark, and if the mark is 1, representing that the training is finished; if the flag is 0, it represents untrained. Step S204 is performed. L is_k

Step S204: detection of L_kIf the flag value is 0, go to step S205; if the flag value is 1, go to step S202.

Step S205: l is_kEntering a text feature extraction module in a text mode, and outputting in a vector object (denoted as vec-obj) mode. Step S206 is performed.

Step S206: initializing model parameters, inputting vec-objs, executing an RNN inference model on vec-objs, calculating a loss value of a loss function, and executing step S207.

Step S207: the model parameters are updated by gradient descent or the like, so that the loss is minimized. When the loss value reaches the optimal solutionStopping the model pair L_kAnd training of L_kThe training flag value of (1).

Step S208: and stopping model training, wherein the model after training is called a File-RNN model.

As shown in fig. 3, fig. 3 is a flowchart of a File slicing module, which is an important module for providing samples for the File-RNN model and is a core of the embodiment of the present invention. The module will set the size of the slice and the number of slices for each type of file collected in advance. The file slicing module process comprises the following steps:

step S301: the size S of the slices and the number C of slices are set separately for each type of file. Step S302 is advanced.

Step S302: for each type of file that is input, if the size of the file is larger than S, go to step S303; otherwise, the process proceeds to step S303.

Step S303: each type file input is cut according to the set S, C. The principle of cutting is to filter out file headers, prevent the magic numbers of files from being involved in the samples, then randomly select the starting point of the files, and intercept the files from the starting point, so that the size of the files is equal to S. The intercepted file segment is the slice. The cutting step is repeated until the number of slices reaches C. The process advances to step S305.

Step S304: the input file is regarded as a slice, and operations such as cutting and the like are not needed. The process advances to step S305.

Step S305: each slice is taken as a model training sample for training the type file.

Example three:

based on the model of the second embodiment, a compressed packet inspection system is provided. The system decompresses the WAR packets before issuing them and detects all the decompressed files using the model given above, which gives each file its type and its probability value. When the result does not coincide with the file's named suffix, a warning is given and the issuance of the WAR package is prevented.

As shown in fig. 4, the compressed packet security detection method shown in fig. 4 includes:

step S401: after the Java Web program is compiled into the WAR packet, firstly, detection is needed, the WAR packet file is read, and a decompression tool is called to decompress the WAR packet file into a folder. All the decompressed files are placed in a list and each file is marked as undetected. Step S402 is performed.

Step S402: and traversing the file list to determine whether undetected files exist. If yes, go to step S403; otherwise, step S409 is performed.

And traversing the decompressed folders, classifying the files belonging to the same type, and storing the files in a list, wherein the list is called a type list. Whether all items in the traversal type list have been tested.

Step S403: and defining field values of the file characteristic structure according to the type of the file definition. Step S404 is performed.

The file characteristic structure may be as shown in table 1.

Step S404: and inputting the field numerical values of the file to be detected and the characteristic structure thereof into a file characteristic extraction module as input parameters, and outputting a vector object. Step S405 is performed.

The file feature extraction module has the following significance: due to the fact that the file information amount is too large, key information needs to be extracted and converted into an object type which can be directly processed by an AI model, and the high efficiency of file detection is guaranteed; first, the file is converted into a binary stream and read into the memory in a segmented manner. Secondly, the module filters invalid information in the file, such as messy codes, meaningless symbols and the like, finally extracts characteristic information in the file stream, and converts the characteristic information into a vector type which can be processed by the AI model.

Step S405: after the File-RNN model receives the output result from step S404, the File-RNN model outputs all the File types to which the File may belong and the probability values corresponding to the File types through derivation and calculation of the model. Step S406 is performed.

After receiving the feature information provided by the text feature extraction module, a File-type recognition model (File-RNN model for short) based on RNN obtains a subset of which File type feature set the feature information at that time belongs to through a forward propagation algorithm according to the parameters obtained by the previous training, and if the feature information belongs to a subset of a certain type, a probability value is calculated according to the coverage rate of the subset, and the probability value represents the probability that the File belongs to the type. Similarly, if the characteristic information belongs to a plurality of types of subsets, a plurality of probability values are given.

Step S406: the file type corresponding to the maximum probability value is selected as the final result, and step S407 is executed.

And judging the type result output by the model and the corresponding probability value, wherein the higher the probability value is, the higher the possibility that the file belongs to the type is theoretically.

Step S407: judging whether the output file type result is consistent with the file suffix name or not, and if not, executing the step S408; otherwise, step S402 is executed.

Step S408: the file is marked as warning and the file name is entered in the warning list.

Step S409: it is checked whether the warning list is empty. If it is empty, go to step S411; otherwise, step S410 is performed.

Step S410: and forbidding the WAR package from being issued, and listing all the warning files.

Step S411: the WAR packet is allowed to issue and the detection result is listed.

As shown in fig. 5, fig. 5 is a flowchart of a file feature extraction module, where the extraction of file feature information depends on a file feature structure, and the structure constrains information content that file features need to contain. A file characteristic structure is defined as shown in table 1, and the attributes of the file characteristic structure include a file code, a file magic number, and a unit length content code. The document coding refers to the coding format of the document, the coding of the document is different, and the identification method is also different if the binary stream of the same type of document is different, so the document coding is necessary. The file magic number refers to the first 4 bytes of a binary form of the file, the magic numbers of files of each type are usually different, and the field is used as a reference value in model training and does not participate in the model training. The unit length content coding means that a file binary stream of the length is extracted within a defined length range, n refers to the number of the extracted segments, and the content values of the fields are used as training parameters of the sample; the file feature extraction module comprises:

step S501: reading the file, and unifying the coding format of the file, such as the unified UTF-8 or GBK coding. Step S502 is performed.

Step S502: the file is converted to a binary stream because other means, such as byte streams, are less efficient at extracting text features and producing vectors than binary streams. Step S503 is performed.

Step S503: and filtering invalid information in the file, wherein the invalid information comprises meaningless symbols, invalid words, messy codes after uniform coding, information content without timestamps and the like, and the filtering method can use regular expressions and the like. Step S504 is performed.

Step S504: feature information is extracted from the binary stream according to the text feature structure defined in table 1. The extracted information will be stored as data structure objects, which are called characteristic information of the file. Step S505 is executed.

Step S505: the document feature information is converted into vector objects (i.e., mathematical symbols) using word2vec, FastText, or other model tools. By this step, the module completes the extraction of the file features and outputs the vector objects which can be directly processed by the AI model.

According to the embodiment of the invention, based on the RNN model of the AI, by training different types of files, the model can judge the type of the file according to the characteristic structure of the file. The AI is different from the traditional technical characteristics, can solve the constantly changing problem through continuous training and optimization, is more flexible and easy to expand compared with the traditional technology, and the problem which can not be solved by the traditional technology usually has good solving effect by utilizing the AI technology.

Furthermore, the embodiment of the invention is mainly used for detecting the released version file and preventing the risk of information leakage caused by the mode that the version file carries sensitive information. File detection has many technical solutions in the industry, but most of them are identified by file headers. However, the file type recognition technology of AI, the combination of the convolutional neural network (RNN) and the file feature structure can calculate the probability of the file dependent type, and does not depend on the file header. The embodiment of the invention can detect whether the WAR packet contains the sensitive information file or not, and prevent sensitive information of an enterprise from being leaked when the version is released.

Example four:

an embodiment of the present invention provides a security detection apparatus, as shown in fig. 6, including a decompression module 601, a File-RNN model identification module 602, and a detection module 603.

The decompression module 601 is configured to read the compressed packet, decompress the compressed packet, and decompress the decompressed packet to obtain each file;

the File-RNN model identification module 602 is configured to identify each File based on a File type identification model File-RNN of a preset AI, and determine a target File type to which each File belongs;

a detecting module 603, configured to output a warning when a target file type to which at least one file belongs is different from a suffix name of the file.

As shown in fig. 7, the File-RNN model recognition module 603 includes a document feature extraction module 6031, which is an important document processing module in the scheme and is used to filter document contents and extract text features of documents; a File slicing module 6032, which is a core module of the scheme and is mainly used for providing training samples for the File-RNN model; and a File-RNN model module 6033, which can detect the belonging type of any File and give a probability value;

a file feature extraction module 6031, configured to convert each file into a binary stream, and filter invalid information in each file; extracting feature information of each file from the filtered binary stream according to a preset text feature structure; the preset text characteristic structure comprises a file code, a file magic number and a unit length content code, the unit length content code is divided into N sections, and N is determined according to a file definition type and is a positive integer; coding N sections of unit length content as the characteristic information of the file; the feature information is converted into a vector object.

A file slicing module 6032, configured to perform file collection on different types of known files, and set the size and number of slices for each type of known file; randomly cutting known files of various types according to the size and the number of the slices, and taking the cut files as file samples; when the size of the known file is larger than a preset threshold value, filtering a file header of the known file; and randomly selecting the starting point of the filtered known file, and cutting according to the slice size until the set number of slices is reached.

A File-RNN model module 6033, configured to perform feature information extraction on a File sample according to a preset File feature structure to obtain a vector object; and executing an RNN inference model on the vector object according to the AI model training, and determining a File-RNN model.

The invention combines the file type identification technology of AI in the field of security and high efficiency of system software, and can detect the security vulnerability of the files which may carry sensitive information in the compressed packet. The method for detecting the files by using the AI is efficient and accurate, and the improvement of the safety of system software products based on the AI technology is innovative in the relevant field, because the AI technology not only can solve the complex problem, but also can prevent the leakage of sensitive information.

The technical scheme provided by the invention has the advantages that a large amount of hardware equipment is not required to be invested, the software cost is low, the product layout is wide (the product layout can be deployed to a terminal, a server, an embedded system and the like), the later maintenance is simple (mainly related to the retraining of the model), and the risk of safety information leakage of an enterprise can be effectively reduced.

The present embodiment further provides a terminal, as shown in fig. 8, which includes a processor 801, a memory 803, and a communication bus 802, where:

the communication bus 802 is used for realizing connection communication between the processor 801 and the memory 803;

the processor 801 is configured to execute one or more computer programs stored in the memory 803 to implement at least one step of the compressed packet security detection method in the above embodiments.

Notably, the AI document type recognition model is a relatively complex RNN (convolutional neural network) based algorithm. Because the time and space complexity of the algorithm is high, the AI model has a high requirement on the response speed of software when training samples, because the training of the samples is a process of continuous optimization and gradual convergence to an optimal solution, and when the number of samples is large, the consumption of a CPU and a RAM is quite serious.

Secondly, if the system is used on the server side (not limited to the local terminal), the system needs to have access to the internet.

In summary, the present invention proposes that the CPU and RAM attached to the system should have strong processing capability, and the system should have the capability of running Internet Protocol (IP), and there is no requirement for physical hardware. For a software environment, since the implementation of the present invention is not dependent on a specific operating system (with portability), there is no requirement for the operating system on which the system product software runs.

The present embodiments also provide a computer-readable storage medium including volatile or non-volatile, removable or non-removable media implemented in any method or technology for storage of information such as computer-readable instructions, data structures, computer program modules or other data. Computer-readable storage media include, but are not limited to, RAM (Random Access Memory), ROM (Read-Only Memory), EEPROM (Electrically Erasable Programmable Read-Only Memory), flash Memory or other Memory technology, CD-ROM (Compact disk Read-Only Memory), Digital Versatile Disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can be accessed by a computer.

The computer readable storage medium in this embodiment may be used to store one or more computer programs, and the stored one or more computer programs may be executed by a processor to implement at least one step of the compressed packet security detection method in the above embodiments.

The present embodiment also provides a computer program (or computer software), which can be distributed on a computer readable medium and executed by a computing device to implement at least one step of the display data processing method and/or the display method in the foregoing embodiments; and in some cases at least one of the steps shown or described may be performed in an order different than that described in the embodiments above.

It should be understood that in some cases, at least one of the steps shown or described may be performed in a different order than described in the embodiments above.

The present embodiments also provide a computer program product comprising a computer readable means on which a computer program as shown above is stored. The computer readable means in this embodiment may include a computer readable storage medium as shown above.

It will be apparent to those skilled in the art that all or some of the steps of the methods, systems, functional modules/units in the devices disclosed above may be implemented as software (which may be implemented in computer program code executable by a computing device), firmware, hardware, and suitable combinations thereof. In a hardware implementation, the division between functional modules/units mentioned in the above description does not necessarily correspond to the division of physical components; for example, one physical component may have multiple functions, or one function or step may be performed by several physical components in cooperation. Some or all of the physical components may be implemented as software executed by a processor, such as a central processing unit, digital signal processor, or microprocessor, or as hardware, or as an integrated circuit, such as an application specific integrated circuit.

In addition, communication media typically embodies computer readable instructions, data structures, computer program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media as known to one of ordinary skill in the art. Thus, the present invention is not limited to any specific combination of hardware and software.

The foregoing is a more detailed description of embodiments of the present invention, and the present invention is not to be considered limited to such descriptions. For those skilled in the art to which the invention pertains, several simple deductions or substitutions can be made without departing from the spirit of the invention, and all shall be considered as belonging to the protection scope of the invention.

Claims

1. A compressed packet security detection method comprises the following steps:

reading a compressed packet, and decompressing the compressed packet to obtain each file;

file type File-RNN detection is carried out on each File based on an AI File type recognition model RRN to identify each File, and a target File type of each File is determined;

when the type of a target file to which at least one file belongs is different from the suffix name of the file, a warning is output.

2. The method for detecting the security of the compressed packet according to claim 1, wherein before the File type identification model File-RNN based on AI identifies the files, the method comprises:

respectively converting each file into binary stream, and filtering invalid information in each file;

and extracting the characteristic information of each file from the filtered binary stream according to a preset text characteristic structure.

3. The method for detecting the security of the compressed packet according to claim 2, wherein the preset text feature structure comprises a file code and a unit length content code;

the extracting the feature information of each file from the filtered binary stream according to a preset text feature structure comprises:

dividing the unit length content code into N sections, wherein N is determined by a preset file definition type of each file, and N is a positive integer;

encoding N sections of unit length content as the characteristic information of the file;

and converting the characteristic information into a vector object.

4. The method for detecting the security of the compressed packet according to claim 3, wherein the File type identification model File-RNN based on AI identifies the files and determines the target File types to which the files belong, and comprises:

the File-RNN model performs derivation calculation on the vector object, and determines the File type of the File corresponding to the vector type and the probability value corresponding to the File type;

and taking the file type corresponding to the maximum probability value as the target file type to which the file belongs.

5. The method for detecting the security of the compressed packet according to any one of claims 1 to 4, wherein before reading the compressed packet and decompressing the compressed packet to obtain each file, the method comprises:

and carrying out model training on File samples of known files of different types according to a Recurrent Neural Network (RNN) in the AI, and determining the File-RNN model.

6. The method for security inspection of compressed packets according to claim 5, wherein the model training of the file samples of different types of known files according to the RNN in AI comprises:

collecting different types of known files;

randomly slicing the known file of each type to obtain a file sample;

extracting feature information of the file sample according to a preset file feature structure to obtain a vector object;

and executing an RNN inference model on the vector object according to AI model training, and determining the File-RNN model.

7. The method for detecting the security of the compressed packet according to claim 6, wherein randomly slicing the known file to obtain a file sample comprises:

setting the size and the number of the slices respectively aiming at the known files of various types;

and randomly cutting known files of various types according to the size and the number of the slices, and taking the divided files as the file samples.

8. The compressed packet security detection method of claim 7, wherein the randomly cutting each type of known file according to the size and number of the slices comprises:

when the size of the known file is larger than a preset threshold value, filtering a file header of the known file;

and randomly selecting the starting point of the filtered known file, and cutting according to the slice size until the set number of slices is reached.

9. The safety detection device is characterized by comprising a decompression module, a File-RNN model identification module and a detection module;

the decompression module is used for reading the compressed packets and decompressing the compressed packets to obtain each file;

the File-RNN model identification module is used for identifying each File based on an AI File type identification model File-RNN and determining the target File type of each File;

the detection module is used for outputting a warning when the type of the target file to which at least one file belongs is different from the suffix name of the file.

10. A terminal comprising a processor, a memory, and a communication bus;

the communication bus is used for realizing connection communication between the processor and the memory;

the processor is configured to execute one or more programs stored in the memory to implement the steps of the compressed packet security detection method according to any one of claims 1 to 8.

11. A storage medium storing one or more programs, the one or more programs being executable by one or more processors to implement the steps of the compressed packet security detection method according to any one of claims 1 to 8.