WO2019201295A1 - 文件识别方法和特征提取方法 - Google Patents

文件识别方法和特征提取方法 Download PDF

Info

Publication number
WO2019201295A1
WO2019201295A1 PCT/CN2019/083200 CN2019083200W WO2019201295A1 WO 2019201295 A1 WO2019201295 A1 WO 2019201295A1 CN 2019083200 W CN2019083200 W CN 2019083200W WO 2019201295 A1 WO2019201295 A1 WO 2019201295A1
Authority
WO
WIPO (PCT)
Prior art keywords
file
image data
identified
transfer matrix
feature
Prior art date
Application number
PCT/CN2019/083200
Other languages
English (en)
French (fr)
Inventor
顾成杰
Original Assignee
新华三信息安全技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 新华三信息安全技术有限公司 filed Critical 新华三信息安全技术有限公司
Publication of WO2019201295A1 publication Critical patent/WO2019201295A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/50Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
    • G06F21/55Detecting local intrusion or implementing counter-measures
    • G06F21/56Computer malware detection or handling, e.g. anti-virus arrangements

Definitions

  • Malicious code is a form of attack by an attacker.
  • a file carrying malicious code is a malicious file, that is, a form in which a malicious file attacks an attacker.
  • Malicious files use network service vulnerabilities to attack network servers for the purpose of stealing information and services.
  • the process of file identification includes: obtaining a file to be identified, running the file to be identified in a sandbox, extracting an operation feature of the file to be identified, normalizing the extracted operation feature, and performing normalized processing.
  • the feature input depth neural network English: Deep Neutral Network, referred to as: DNN
  • DNN Deep Neutral Network
  • the DNN model is trained by using the running characteristics of the file.
  • FIG. 1 is a schematic diagram of a first process of a file identification method according to an embodiment of the present application
  • FIG. 2 is a first schematic diagram of a transfer matrix provided by an embodiment of the present application.
  • FIG. 3 is a second schematic diagram of a transfer matrix provided by an embodiment of the present application.
  • FIG. 4 is a schematic diagram of image data based on the transfer matrix shown in FIG. 3;
  • FIG. 5 is a schematic structural diagram of a convolutional neural network model according to an embodiment of the present application.
  • FIG. 6 is a schematic flowchart of a model training method according to an embodiment of the present application.
  • FIG. 7 is a second schematic flowchart of a file identification method according to an embodiment of the present disclosure.
  • FIG. 8 is a schematic flowchart of a feature extraction method according to an embodiment of the present disclosure.
  • FIG. 9 is a schematic diagram of a first structure of a file identification apparatus according to an embodiment of the present application.
  • FIG. 10 is a schematic diagram of a second structure of a file identification apparatus according to an embodiment of the present disclosure.
  • FIG. 11 is a schematic structural diagram of a feature extraction apparatus according to an embodiment of the present application.
  • FIG. 12 is a schematic diagram of a first structure of a network device according to an embodiment of the present disclosure.
  • FIG. 13 is a schematic diagram of a second structure of a network device according to an embodiment of the present disclosure.
  • FIG. 14 is a schematic diagram of a third structure of a network device according to an embodiment of the present disclosure.
  • the running feature of the file to be identified in the sandbox is set by the user according to experience, that is, the file recognition depends on the subjective factor of the person, and the accuracy of file recognition is low.
  • the embodiment of the present application provides a file identification method.
  • the file identification method can be applied to network devices such as firewall devices, routers, switches, and the like.
  • the method can also be performed by a file identification device, which can be implemented in hardware and/or software, and can generally be integrated into a network device for file identification.
  • the file identification method provided by the embodiment of the present application converts the file to be identified into image data, extracts features of the image data, and then determines whether the file to be identified is a malicious file according to the extracted feature.
  • the feature of the image data is an objectively existing feature in the file to be identified, rather than being set according to experience, obtaining a file recognition result according to the objective existence feature, reducing the dependence of the file recognition on the subjective factor of the person, and improving the file The accuracy of the identification. Therefore, the file identification method provided by the embodiment of the present application is more accurate.
  • FIG. 1 is a schematic diagram of a first process of a file identification method according to an embodiment of the present application, where the method includes the following process.
  • the execution subject of the file identification method is a network device as an example.
  • the file to be identified obtained by the network device may be: a file sent by another network device to the network device.
  • the file to be identified obtained by the network device may also be: a file obtained from a locally stored file.
  • section 102 determining a plurality of character strings corresponding to the file to be identified according to the preset reading rule and the preset phrase model.
  • determining a plurality of character strings corresponding to the file to be identified according to the preset reading rule and the preset phrase model may include: reading the file to be identified according to the preset reading rule, and obtaining multiple A character that combines adjacent characters of a plurality of characters according to a preset phrase model to obtain a plurality of character strings.
  • the reading rule may include: binary, octal, or hexadecimal, but is not limited to these types of reading rules.
  • the preset phrase model may include a binary phrase (English: BiGram) model and/or a ternary word (English: TriGram) model.
  • a transfer matrix is constructed according to a plurality of strings corresponding to the file to be identified. Among them, the elements in the transfer matrix correspond one-to-one with the type of the string.
  • the type of the string is the type of the string, and the type of the string obtained is different depending on the reading rule and/or the phrase model.
  • constructing the transfer matrix according to the plurality of character strings corresponding to the file to be identified may include: determining the number of occurrences of each character string in the plurality of character strings, according to the number of occurrences of each character string Construct a transfer matrix.
  • the number of rows and the number of columns of the transfer matrix are the same, and the number of rows and columns of the transfer matrix are: the ratio of the number of string types to the number of character types.
  • the number of the type of the string is: the number of types of the obtained string when the character string is determined according to the preset reading rule and the preset phrase model; the number of character types is: when the file is read according to the preset reading rule, The number of types of characters to be reached.
  • the preset reading rule is hexadecimal
  • the preset phrase model includes the BiGram model and the TriGram model.
  • constructing the transfer matrix according to the number of occurrences of each character string may include: for each character string, the number of occurrences of the character string as the value of the element corresponding to the character string in the transfer matrix , get the transfer matrix.
  • the BiGram model is taken as an example with a preset phrase model.
  • the network device obtains the file f1 to be identified, reads the file f1 to be identified according to a preset reading rule, and obtains a plurality of characters: abcbbcdabcd.
  • the adjacent characters of the plurality of characters corresponding to the file f1 to be identified are combined, and the obtained plurality of characters are: ab, bc, cb, bb, bc, cd, da, ab, bc, cd.
  • the number of occurrences of each character string is: "ab” appears twice, “bc” appears as 3, “cb” appears as 1, "bb” appears as 1, "cd” appears For 2, the number of occurrences of "da” is 1. The number of occurrences of other strings is 0.
  • each square in FIG. 2 represents an element of the matrix, and the horizontal corresponding to the square
  • the character and the vertical character corresponding to the square form a string, which is the string corresponding to the square.
  • constructing a transfer matrix according to the number of occurrences of each character string may include: calculating, for each character string, the number of occurrences of the character string and The sum value of the initial value is preset, and the calculated sum value is used as the value of the element corresponding to the character string in the transfer matrix to obtain a transfer matrix.
  • each square in FIG. 3 represents an element of the matrix, and the horizontal corresponding to the square
  • the character and the vertical character corresponding to the square form a string, which is the string corresponding to the square.
  • section 104 determining target image data corresponding to the file to be identified according to the elements in the transfer matrix.
  • the target image data corresponding to the file to be identified is determined according to the elements in the transfer matrix.
  • one element in the transfer matrix corresponds to one image cell
  • the target image data corresponding to the file to be identified is determined, that is, the value of each element in the transfer matrix is converted into image data.
  • the color depth of the image cell corresponding to each element in the transfer matrix is calculated according to the value of each element in the transfer matrix, and the target image data corresponding to the file to be identified is obtained.
  • the above image cell is the smallest unit of image processing.
  • the color depth is the gray value of the point in the black and white image. In the embodiment of the present application, the color depth is taken as the value of the image cell.
  • the color depth ranges from 0 to 255, white is 255, and black is 0.
  • the range of the color depth is not limited in the embodiment of the present application, that is, the color depth may be an integer, may be a decimal number, may be a positive number, or may be a negative number.
  • the color depth of the image cell corresponding to each element may be determined in the following manner.
  • the value of the first element is a first value; wherein the first element is any element in the transfer matrix, and the value of the first element is based on the number of occurrences of the first string determine.
  • the first string is a string corresponding to the first element in the transfer matrix.
  • the sum of the values of all the second elements is determined to be the second value.
  • the value of the second element is determined according to the number of occurrences of the second character string, and the head word of the second character string is the same as the head word of the first character string.
  • the first character string is included in the second character string.
  • the first word is the first character.
  • the color depth of the image cell corresponding to the first element in the transfer matrix is determined.
  • the calculated ratio ie, the ratio of the first value to the second value
  • the image corresponding to the first element in the transfer matrix may be used as the image corresponding to the first element in the transfer matrix.
  • the color depth of the cell may be used for each element in the transfer matrix.
  • the transition probability of the first element may be determined according to the following formula:
  • T the calculated ratio, ie the ratio of the first value to the second value.
  • h the transition probability of the first element
  • T the calculated ratio, ie the ratio of the first value to the second value.
  • the calculated transition probability of the first element is determined as the color depth of the image cell corresponding to the first element.
  • section 105 extracting features of the target image data, and determining whether the file to be identified is a malicious file according to characteristics of the target image data.
  • a feature of the target image data may be extracted using a Convolutional Neural Network (CNN) model.
  • CNN Convolutional Neural Network
  • the CNN model adopted in the embodiment of the present application can be improved based on the classic CNN Lenet-5 model and based on the classic CNN Lenet-5 structure.
  • Lenet-5 is a classic CNN network architecture, including 3 convolutional layers, 2 pooling layers and 2 fully connected layers.
  • the improvement of the Lenet-5 structure is shown in FIG.
  • the first convolutional layer includes 32 convolution kernels
  • the second convolutional layer includes 64 convolution kernels.
  • DropOut Add a 0.25 drop (English: DropOut) layer to the second pooling layer, and add 0.5 to the DropOut layer after the first fully connected layer.
  • the DropOut layer can also be called the Discard layer.
  • the feature of the target image data may be identified by using the DNN model, that is, the DNN model is used to identify the identified file by using the feature of the target image data to determine whether the file to be identified is a malicious file.
  • the feature of the target image data is input into the pre-trained DNN model to obtain an output result, wherein the output result indicates whether the file to be identified is a malicious file.
  • the output result indicates that the file to be identified is a malicious file, or the output result indicates that the file to be identified is a non-malicious file.
  • a non-malicious file is a secure file.
  • the DNN model inputting the feature of the target image data into the DNN model, obtaining a first probability that the file to be identified is a security file, and a second probability that the file to be identified is a malicious file. If the first probability is greater than the second probability, the output of the DNN model indicates that the file to be identified is a security file. Otherwise, the output of the DNN model indicates that the file to be identified is a malicious file.
  • the feature of the image data is used to determine whether the file to be identified is a malicious file.
  • the characteristics of the image data are the characteristics of the objective file to be identified, rather than being set according to experience. According to the recognition result of the objective existence feature, the dependence of the document recognition on the subjective factors of the person is reduced, and the accuracy of the document recognition is improved. .
  • the DNN model and the CNN model may be pre-trained before the identification of the file to be recognized.
  • the model training method shown in FIG. 6 The method includes the following process.
  • the initialized parameter set may be represented by ⁇ i .
  • the parameters of the initialization can be set according to actual needs and experience. i is the number of times/cumulative times of the current forward calculation.
  • the initialized parameter set may be Said.
  • the parameters of the initialization can be set according to actual needs and experience. i is the number of times/cumulative times of the current forward calculation.
  • training-related high-level parameters such as learning rate, gradient descent algorithm, back propagation algorithm, etc.
  • the training-related high-level parameters may be set in various manners in the related art, and are not described in detail herein.
  • the preset training set includes a sample file and a label of the sample file, and the label may include: a first label for indicating that the file is a malicious file and a second label for indicating that the file is a non-malicious file.
  • the sample file can be a binary file.
  • the sample file included in the preset training set may be obtained from the network through a web crawler or the like, or may be obtained from a pre-acquired sample file library, which is not limited by the embodiment of the present application.
  • the order of execution of the 601, 602, and 603 portions is not limited in the embodiment of the present application.
  • Section 604 Convert each sample file in the preset training set to image data.
  • Section 605 Perform a forward calculation as follows.
  • the image data of each sample file obtained in Section 604 is input to a preset CNN model to obtain features of the image data corresponding to the sample file.
  • the feature outputted by the preset CNN model is input into a preset DNN model to obtain an output result corresponding to the sample file.
  • the output indicates that the sample file is a secure file or indicates that the sample file is a malicious file.
  • the third probability that the sample file is a security file and the fourth probability that the sample file is a malicious file are obtained. If the third probability is greater than the fourth probability, determining that the output result corresponding to the sample file is the security file of the sample file; otherwise, determining that the output result corresponding to the sample file is the malicious file of the sample file.
  • the current parameter set is ⁇ 1
  • the current parameter set ⁇ i is obtained by adjusting the parameter set ⁇ i-1 used last time, and the current parameter set is obtained.
  • the last used parameter set For the adjustment, please refer to the following description.
  • the loss value is calculated based on the label of each sample file and the output corresponding to the preset DNN model.
  • the mean square error (English: Mean Squared Error, MSE) formula can be used as the loss function to obtain the loss value L( ⁇ i ), as shown in the following formula:
  • H represents the number of sample files selected from the preset training set in a single training
  • I j represents the characteristics of the image data corresponding to the jth sample file
  • ⁇ i ) represents the jth sample file.
  • the output result of the forward calculation of the DNN model under the parameter set ⁇ i , X j represents the label of the jth sample file, and i is the number of times/cumulative times of the current forward calculation.
  • the preset model includes a CNN model and a preset DNN model.
  • the convergence may be determined when the value of the loss is less than the threshold value of the preset loss value.
  • the convergence may be determined when the difference between the value of the loss and the value of the previous calculation is less than the preset change threshold. There is no limit here.
  • section 608 on the current parameter set ⁇ i and The parameters in the adjustment are adjusted to get the adjusted parameter set, and then enter the 605 part for the next forward calculation.
  • the back propagation algorithm can be used to adjust the parameters in the current parameter set.
  • the current parameter set ⁇ i is taken as the final parameter set of the output ⁇ final
  • the current parameter set The final parameter set as output
  • the preset DNN model of the final parameter set ⁇ final will be used as the trained DNN model.
  • Final parameter set The preset CNN model is used as a trained CNN model.
  • the training of the above CNN model and DNN model can be implemented on the same network device as the file identification.
  • the trained network devices of the CNN model and the DNN model may be different from the network devices identified by the file.
  • the feature of the target image data may be identified by using a malicious file feature library to determine whether the file to be identified is a malicious file.
  • the malicious file feature library includes: features of the image data corresponding to the plurality of sample malicious files. Specifically, the target image data is input into the CNN model, and the output result of the preset layer of the CNN model is acquired as a feature of the target image data. Find the characteristics of the target image data from the preset malicious file feature library. If found, it is determined that the file to be identified is a malicious file. If not found, it is determined that the file to be identified is a security file.
  • the image data corresponding to the sample malicious file can be input into the CNN model to obtain the prediction of the CNN model.
  • the output result of the layer is used to take the output result of the preset layer of the CNN model as the feature of the image data corresponding to the sample malicious file.
  • a malicious file signature database is constructed from the characteristics of the image data corresponding to the plurality of sample malicious files.
  • the preset layer may be the third volume of the CNN model. Layered, as shown in Figure 4.
  • the feature length of the third convolutional layer output is 512 bytes.
  • the features in the malicious file feature library are directly extracted from the malicious file, if the feature of the file to be identified matches the feature in the malicious file feature database, the file to be identified can be determined as a malicious file, and the accuracy of the file recognition is improved. . In addition, compared with the calculation amount of the DNN model identification feature, the feature calculation in the matching malicious file feature library is much smaller, and the efficiency of file recognition is improved.
  • FIG. 7 is a second schematic flowchart of a file identification method according to an embodiment of the present application, including the following process.
  • the execution subject of the file identification method is a network device as an example.
  • the file to be identified obtained by the network device may be: a file sent by another network device to the network device.
  • the file to be identified obtained by the network device may also be: a file obtained from a locally stored file.
  • the file to be identified is input into a pre-trained file recognition model to determine whether the file to be identified is a malicious file.
  • the file recognition model is configured to: determine a plurality of character strings corresponding to the input file according to the preset reading rule and the preset phrase model; construct a transfer matrix according to the plurality of strings, and convert the elements and the string type in the matrix Corresponding to; determining the target image data corresponding to the input file according to the elements in the transfer matrix; extracting features of the target image data, and determining whether the input file is a malicious file according to the characteristics of the target image data.
  • the input file is a file that identifies the model of the input file.
  • the input file is the file to be identified.
  • the type of the string is the type of the string, and the type of the string obtained is different depending on the reading rule and/or the phrase model.
  • determining a plurality of character strings corresponding to the input file according to the preset reading rule and the preset phrase model may include: reading an input file according to a preset reading rule to obtain a plurality of characters, According to the preset phrase model, a plurality of characters are obtained by combining adjacent characters of a plurality of characters.
  • the reading rule may include: binary, octal, or hexadecimal, but is not limited to these types of reading rules.
  • the preset phrase model may include a BiGram model and/or a TriGram model.
  • constructing the transfer matrix according to the plurality of character strings corresponding to the input file may include: determining the number of occurrences of each character string in the plurality of character strings, according to the number of occurrences of each character string, Construct a transfer matrix.
  • the number of rows and the number of columns of the transfer matrix are the same, and the number of rows and columns of the transfer matrix are: the ratio of the number of string types to the number of character types.
  • the number of the type of the string is: the number of types of the string that can be obtained when the character string is determined according to the preset reading rule and the preset phrase model; the number of character types is: when the file is read according to the preset reading rule, The number of types of characters that can be obtained.
  • the preset reading rule is hexadecimal
  • the preset phrase model may include a BiGram model and a TriGram model.
  • the number of rows and the number of columns of the transfer matrix are the same, and the elements in the transfer matrix correspond one-to-one with the type of the string, the number of rows and the number of columns of the transfer matrix may be 272. That is, a transfer matrix of 272*272 can be constructed according to the number of occurrences of each character string corresponding to the input file.
  • constructing the transfer matrix according to the number of occurrences of each character string may include: for each character string, the number of occurrences of the character string as the value of the element corresponding to the character string in the transfer matrix , get the transfer matrix.
  • constructing a transfer matrix according to the number of occurrences of each character string may include: calculating, for each character string, the number of occurrences of the character string and The sum value of the initial value is preset, and the calculated sum value is used as the value of the element corresponding to the character string in the transfer matrix to obtain a transfer matrix.
  • one element in the transfer matrix corresponds to one image cell
  • the target image data corresponding to the input file is determined, that is, the value of each element in the transfer matrix is converted into image data.
  • determining the target image data corresponding to the input file according to the elements in the transfer matrix may include: calculating a color depth of the image cell corresponding to each element in the transfer matrix according to the value of each element in the transfer matrix, and obtaining an input file corresponding to the input file Target image data.
  • the above image cell is the smallest unit of image processing.
  • the color depth is the gray value of the point in the black and white image. In the embodiment of the present application, the color depth is taken as the value of the image cell.
  • the color depth of the image cell corresponding to each element may be determined in the following manner. Specifically, calculating the color depth of the image cell corresponding to each element in the transfer matrix according to the value of each element in the transfer matrix may include: determining, for the first element in the transfer matrix, a value of the first element as the first value.
  • the first element is any element in the transfer matrix, and the value of the first element is determined according to the number of occurrences of the first character string.
  • the first string is a string corresponding to the first element in the transfer matrix.
  • the sum of the values of all the second elements is determined to be the second value.
  • the value of the second element is determined according to the number of occurrences of the second character string, and the head word of the second character string is the same as the head word of the first character string.
  • the first word is the first character.
  • the color depth of the image cell corresponding to the first element in the transfer matrix is determined.
  • the calculated ratio may be used as the color depth of the image cell corresponding to the first element in the transfer matrix.
  • the transition probability of the first element may be determined according to the following formula:
  • h is the transition probability of the first element
  • T is the calculated ratio, ie the ratio of the first value to the second value.
  • the calculated transition probability of the first element is determined as the color depth of the image cell corresponding to the first element.
  • extracting features of the target image data may include: inputting the target image data into the pre-trained CNN model to obtain features of the target image data.
  • the adopted CNN model can be improved based on the classical CNN Lenet-5 model and based on the classical CNN Lenet-5 structure.
  • Lenet-5 is a classic CNN network architecture, including 3 convolutional layers, 2 pooling layers and 2 fully connected layers.
  • the improvement of the Lenet-5 structure is shown in FIG.
  • the first convolutional layer includes 32 convolution kernels
  • the second convolutional layer includes 64 convolution kernels.
  • the feature of the target image data may be identified by using the DNN model, that is, the DNN model is used to identify the identified file by using the feature of the target image data to determine whether the input file is a malicious file.
  • determining whether the input file is a malicious file according to the feature of the target image data may include: inputting the feature of the target image data into the pre-trained DNN model to obtain an output result; wherein the DNN model is used to perform the feature of the image data. Identifying whether the file corresponding to the image data is a malicious file, and the output result indicates whether the input file is a malicious file.
  • inputting features of the target image data into the DNN model yields a first probability that the input file is a secure file and a second probability that the input file is a malicious file. If the first probability is greater than the second probability, the output of the DNN model indicates that the input file is a secure file. Otherwise, the output of the DNN model indicates that the input file is a malicious file.
  • the DNN model and the CNN model may be pre-trained before the identification of the file to be recognized.
  • the training process of the DNN model and the CNN model can be described with reference to the description of sections 601-609 of the embodiment shown in FIG. 6.
  • the feature of the target image data may be identified by using a malicious file feature library to determine whether the file to be identified is a malicious file.
  • the malicious file feature library includes: features of the image data corresponding to the plurality of sample malicious files. Specifically, the target image data is input into the CNN model, and the output result of the preset layer of the CNN model is acquired as a feature of the target image data. Find the characteristics of the target image data from the preset malicious file feature library. If found, it is determined that the input file is a malicious file. If not found, make sure the input file is a secure file.
  • the image data corresponding to the sample malicious file can be input into the CNN model to obtain the prediction of the CNN model.
  • the output result of the layer is used to take the output result of the preset layer of the CNN model as the feature of the image data corresponding to the sample malicious file.
  • a malicious file signature library is constructed from the characteristics of the image data corresponding to the plurality of sample malicious files.
  • the preset layer may be the third volume of the CNN model. Layered, as shown in Figure 4.
  • the feature length of the third convolutional layer output is 512 bytes.
  • the features in the malicious file feature library are directly extracted from the malicious file, if the feature of the file to be identified matches the feature in the malicious file feature database, the file to be identified can be determined as a malicious file, and the accuracy of the file recognition is improved. . In addition, compared with the calculation amount of the DNN model identification feature, the feature calculation in the matching malicious file feature library is much smaller, and the efficiency of file recognition is improved.
  • the file to be identified when file identification is performed, the file to be identified is converted into image data, and the feature of the image data is extracted, and according to the extracted feature, whether the file to be identified is a malicious file is determined.
  • the image data is characterized by the objective existence of the file to be identified, and is not set according to experience, and the file recognition result is obtained according to the objective existence feature, which reduces the dependence of the file recognition on the subjective factors of the person and improves the file recognition. The accuracy.
  • FIG. 8 is a schematic flowchart of a feature extraction method according to an embodiment of the present application. The method includes the following process.
  • Section 801 Multiple sample files are entered into the file recognition model.
  • the file identification model is configured to: determine a plurality of character strings corresponding to the input file according to the preset reading rule and the preset phrase model; construct a transfer matrix according to the plurality of character strings corresponding to the input file; according to the transfer matrix of the file.
  • the element determines the target image data corresponding to the input file, wherein the elements in the transfer matrix correspond one-to-one with the type of the string; the feature of the input target image data is extracted by using the CNN model, and the feature of the target image data is identified by using the DNN model.
  • the type of the string is the type of the string, and the type of the obtained string is different according to different reading rules and/or phrase models.
  • the input file is a file that identifies the model of the input file.
  • the multiple sample files are input files.
  • the DNN model and the CNN model are trained before the feature is extracted.
  • the training process of the DNN model and the CNN model can be described with reference to the description of sections 601-609 of the embodiment shown in FIG. 6.
  • Section 802 For each sample file, the output of the preset layer of the CNN model is extracted as a feature of the sample file.
  • determining a plurality of character strings corresponding to the input file according to the preset reading rule and the preset phrase model may include: reading the input file according to the preset reading rule to obtain multiple characters. According to the preset phrase model, a plurality of characters are obtained by combining adjacent characters of a plurality of characters.
  • the reading rule may include: binary, octal, or hexadecimal, but is not limited to these types of reading rules.
  • the preset phrase model can include a BiGram model and/or a TriGram model.
  • constructing the transfer matrix according to the plurality of character strings corresponding to the input file may include: determining the number of occurrences of each character string in the plurality of character strings; according to the number of occurrences of each character string, Construct a transfer matrix.
  • the number of rows and the number of columns of the transfer matrix are the same, and the number of rows and columns of the transfer matrix are: the ratio of the number of string types to the number of character types.
  • the number of the type of the string is: the number of types of the string that can be obtained when the character string is determined according to the preset reading rule and the preset phrase model; the number of character types is: when the file is read according to the preset reading rule, The number of types of characters that can be obtained.
  • the preset reading rule is hexadecimal
  • the preset phrase model may include a BiGram model and a TriGram model.
  • the number of rows and the number of columns of the transfer matrix are the same, and the elements in the transfer matrix correspond one-to-one with the type of the string, the number of rows and the number of columns of the transfer matrix may be 272. That is, a transfer matrix of 272*272 can be constructed according to the number of occurrences of each character string corresponding to the input file.
  • constructing the transfer matrix according to the number of occurrences of each character string may include: for each character string, the number of occurrences of the character string as the value of the element corresponding to the character string in the transfer matrix , get the transfer matrix.
  • constructing a transfer matrix according to the number of occurrences of each character string may include: calculating, for each character string, the number of occurrences of the character string and The sum value of the initial value is preset, and the calculated sum value is used as the value of the element corresponding to the character string in the transfer matrix to obtain a transfer matrix.
  • one element in the transfer matrix corresponds to one image cell
  • the target image data corresponding to the input file is determined, that is, the value of each element in the transfer matrix is converted into image data.
  • determining the target image data corresponding to the input file according to the elements in the transfer matrix may include: calculating a color depth of the image cell corresponding to each element in the transfer matrix according to the value of each element in the transfer matrix, and obtaining an input file corresponding to the input file Target image data.
  • the above image cell is the smallest unit of image processing.
  • the color depth is the gray value of the point in the black and white image. In the embodiment of the present application, the color depth is taken as the value of the image cell.
  • the color depth of the image cell corresponding to each element may be determined in the following manner. Specifically, the color depth of the image cell corresponding to each element in the transfer matrix is calculated according to the value of each element in the transfer matrix, and may include:
  • the value of the first element is determined to be the first value.
  • the first element is any element in the transfer matrix, and the value of the first element is determined according to the number of occurrences of the first character string.
  • the first string is a string corresponding to the first element in the transfer matrix.
  • the sum of the values of all the second elements is determined to be the second value.
  • the value of the second element is determined according to the number of occurrences of the second character string, and the head word of the second character string is the same as the head word of the first character string.
  • the first character string is included in the second character string.
  • the first word is the first character.
  • the color depth of the image cell corresponding to the first element in the transfer matrix is determined.
  • the calculated ratio may be used as the color depth of the image cell corresponding to the first element in the transfer matrix.
  • the transition probability of the first element may be determined according to the following formula:
  • h is the transition probability of the first element
  • T is the calculated ratio, ie the ratio of the first value to the second value.
  • the calculated transition probability of the first element is determined as the color depth of the image cell corresponding to the first element.
  • the adopted CNN model can be improved based on the classical CNN Lenet-5 model and based on the classical CNN Lenet-5 structure.
  • Lenet-5 is a classic CNN network architecture, including 3 convolutional layers, 2 pooling layers and 2 fully connected layers.
  • the improvement of the Lenet-5 structure is shown in FIG.
  • the first convolutional layer includes 32 convolution kernels
  • the second convolutional layer includes 64 convolution kernels.
  • the sample file is a sample malicious file.
  • the method may further include: constructing a malicious file feature library according to the extracted multiple features.
  • the preset layer may be the third volume of the CNN model. Laminated.
  • the feature length of the third convolutional layer output is 512 bytes.
  • the malicious file feature library may be used to identify the identified file to determine whether the file to be identified is a malicious file.
  • the file to be identified is input into the file recognition model; the output result of the preset layer of the CNN model in the file recognition model is obtained as a target feature; and the target feature is searched from the malicious file feature database. If found, it is determined that the file to be identified is a malicious file. If not found, it is determined that the file to be identified is a security file.
  • the feature of the preset layer output of the CNN model in the recognition model obtained by the pre-training is extracted, and the feature of extracting the file is not required to be manually analyzed, thereby improving the efficiency of feature extraction and reducing the labor cost.
  • a malicious file feature library is constructed based on the extracted features of the malicious file, and the identified file is identified based on the malicious file feature library. Since the features included in the malicious file feature library are directly extracted from the malicious file, if the feature of the file to be identified matches the feature in the malicious file feature database, the file to be identified can be determined as a malicious file, which improves the accuracy of the file identification. Sex. In addition, compared with the calculation amount of the DNN model identification feature, the feature calculation in the matching malicious file feature library is much smaller, and the efficiency of file recognition is improved.
  • FIG. 9 is a schematic diagram of a first structure of a file identification apparatus according to an embodiment of the present disclosure, where the apparatus includes:
  • the obtaining module 901 is configured to obtain a file to be identified
  • the first determining module 902 is configured to determine, according to the preset reading rule and the preset phrase model, a plurality of character strings corresponding to the file to be identified;
  • the construction module 903 is configured to construct a transfer matrix according to the plurality of character strings; wherein the elements in the transfer matrix have a one-to-one correspondence with the type of the string;
  • a second determining module 904 configured to determine, according to an element in the transfer matrix, target image data corresponding to the file to be identified;
  • the identification module 905 is configured to extract features of the target image data, and determine, according to characteristics of the target image data, whether the file to be identified is a malicious file.
  • the above-mentioned character string type is a type of a character string, and the types of character strings acquired are different depending on different reading rules and/or phrase models.
  • the first determining module 902 may be specifically configured to:
  • a plurality of characters are obtained by combining adjacent characters of a plurality of characters.
  • the building module 903 is specifically configured to:
  • a transfer matrix is constructed based on the number of occurrences of each string.
  • the building module 903 is specifically configured to:
  • the number of occurrences of the string is used as the value of the element corresponding to the string in the transfer matrix to obtain a transfer matrix
  • the sum of the number of occurrences of the string and the preset initial value is calculated, and the calculated sum value is used as the value of the element corresponding to the string in the transfer matrix to obtain a transfer matrix.
  • the second determining module 904 is specifically configured to:
  • the color depth of the image module corresponding to each element in the transfer matrix is calculated according to the value of each element in the transfer matrix, and the target image data corresponding to the file to be identified is obtained.
  • the second determining module 904 is specifically configured to:
  • the color depth of the image cell corresponding to the first element is determined according to the calculated ratio.
  • the first character string is a character string corresponding to the first element in the transfer matrix.
  • the second determining module 904 is specifically configured to:
  • the transition probability of the first element is determined according to the following formula:
  • the calculated transition probability of the first element is determined as the color depth of the image module corresponding to the first element.
  • the identification module 905 may be specifically configured to: input target image data into a pre-trained CNN model to obtain features of the target image data;
  • the CNN model is based on the classic CNN Lenet-5 model.
  • the first convolutional layer consists of 32 convolution kernels
  • the second convolutional layer consists of 64 convolution kernels
  • the second pooled layer adds 0.25.
  • a DropOut layer 0.5 is added after the first fully connected layer.
  • the identification module 905 is specifically configured to:
  • the feature of the target image data is input into the pre-trained DNN model to obtain an output result; wherein the DNN model is used to identify the file by using the feature of the image data, determine whether the file corresponding to the image data is a malicious file, and output the result indicating the file to be identified. Whether it is a malicious file.
  • the feature of the target image data is an output result of a preset layer of the CNN model
  • the identification module 905 can be specifically used to:
  • the preset malicious file feature library includes: a feature of the image data corresponding to the plurality of sample malicious files;
  • the file to be identified is a security file.
  • the feature of the image data corresponding to the plurality of sample malicious files may be obtained by inputting the image data corresponding to the sample malicious file into the CNN model for each sample malicious file, and the CNN model is The preset layer corresponds to the output result as a feature of the corresponding image data.
  • the file to be identified when file identification is performed, the file to be identified is converted into image data, and the feature of the image data is extracted, and according to the extracted feature, whether the file to be identified is a malicious file is determined.
  • the image data is characterized by the objective existence of the file to be identified, and is not set according to experience, and the file recognition result is obtained according to the objective existence feature, which reduces the dependence of the file recognition on the subjective factors of the person and improves the file recognition. The accuracy.
  • FIG. 10 is a schematic diagram of a second structure of a file identification apparatus according to an embodiment of the present disclosure.
  • the apparatus includes: an obtaining module 1001, an input module 1002, and a file identification model, where the file identification model includes: a first determining module 1003, a building module 1004, a second determining module 1005 and an identifying module 1006;
  • the obtaining module 1001 is configured to obtain a file to be identified
  • the input module 1002 is configured to input the file to be identified into the pre-trained file recognition model
  • the first determining module 1003 is configured to determine, according to the preset reading rule and the preset phrase model, a plurality of character strings corresponding to the input file;
  • the construction module 1004 is configured to construct a transfer matrix according to the plurality of character strings corresponding to the input file; the elements in the transfer matrix are in one-to-one correspondence with the type of the string;
  • a second determining module 1005 configured to determine, according to an element in the transfer matrix, target image data corresponding to the input file
  • the identification module 1006 is configured to extract features of the target image data, and determine, according to characteristics of the target image data, whether the input file is a malicious file.
  • the above-mentioned character string type is a type of a character string, and the types of character strings acquired are different depending on different reading rules and/or phrase models.
  • the first determining module 1003 may be specifically configured to:
  • the input file is read according to a preset reading rule to obtain a plurality of characters; according to the preset phrase model, adjacent characters of the plurality of characters are combined to obtain a plurality of character strings.
  • the building module 1004 is specifically configured to:
  • the building module 1004 is specifically configured to:
  • the number of occurrences of the string is used as the value of the element corresponding to the string in the transfer matrix to obtain a transfer matrix
  • the sum of the number of occurrences of the string and the preset initial value is calculated, and the calculated sum value is used as the value of the element corresponding to the string in the transfer matrix to obtain a transfer matrix.
  • the second determining module 1005 may be specifically configured to:
  • the color depth of the image module corresponding to each element in the transfer matrix is calculated according to the value of each element in the transfer matrix, and the target image data corresponding to the input file is obtained.
  • the second determining module 1005 may be specifically configured to:
  • the color depth of the image cell corresponding to the first element is determined according to the calculated ratio.
  • the first character string is a character string corresponding to the first element in the transfer matrix.
  • the second determining module 1005 may be specifically configured to:
  • the transition probability of the first element is determined according to the following formula:
  • the calculated transition probability of the first element is determined as the color depth of the image module corresponding to the first element.
  • the identification module 1006 may be specifically configured to: input target image data into a pre-trained CNN model to obtain features of the target image data;
  • the CNN model is based on the classic CNN Lenet-5 model.
  • the first convolutional layer consists of 32 convolution kernels
  • the second convolutional layer consists of 64 convolution kernels
  • the second pooled layer adds 0.25.
  • a DropOut layer 0.5 is added after the first fully connected layer.
  • the identification module 1006 may be specifically configured to:
  • the feature of the target image data is input into the pre-trained DNN model to obtain an output result; wherein the DNN model is used to identify the file by using the feature of the image data to determine whether the file corresponding to the image data is a malicious file, and the output result indicates whether the input file is For malicious files.
  • the feature of the target image data is an output result of a preset layer of the CNN model
  • the identification module 1006 can be specifically used to:
  • the preset malicious file feature library includes: a feature of the image data corresponding to the plurality of sample malicious files;
  • the feature of the image data corresponding to the plurality of sample malicious files may be obtained by inputting the image data corresponding to the sample malicious file into the CNN model for each sample malicious file, and the CNN model is The preset layer corresponds to the output result as a feature of the corresponding image data.
  • the file to be identified when file identification is performed, the file to be identified is converted into image data, and the feature of the image data is extracted, and according to the extracted feature, whether the file to be identified is a malicious file is determined.
  • the image data is characterized by the objective existence of the file to be identified, and is not set according to experience, and the file recognition result is obtained according to the objective existence feature, which reduces the dependence of the file recognition on the subjective factors of the person and improves the file recognition. The accuracy.
  • FIG. 11 is a schematic structural diagram of a feature extraction apparatus according to an embodiment of the present disclosure.
  • the device includes: an input module 1101, an extraction module 1102, and a file recognition model.
  • the file identification model includes a first determining module 1103, and a first The building module 1104, the second determining module 1105, and the first identifying module 1106.
  • the input module 1101 is configured to input multiple sample files into the file recognition model respectively;
  • the first determining module 1103 is configured to determine, according to the preset reading rule and the preset phrase model, a plurality of character strings corresponding to the input file;
  • the first constructing module 1104 is configured to construct a transfer matrix according to the plurality of character strings corresponding to the input file; the elements in the transfer matrix are in one-to-one correspondence with the type of the character string;
  • a second determining module 1105 configured to determine, according to an element in the transfer matrix, target image data corresponding to the input file
  • the first identification module 1106 is configured to extract features of the input target image data by using the CNN model, and identify the features of the target image data by using the DNN model to determine whether the input file is a malicious file;
  • the extracting module 1102 is configured to extract, for each sample file, an output result of the feature outputted by the preset layer of the CNN model as a feature of the sample file.
  • the above-mentioned character string type is a type of a character string, and the types of character strings acquired are different depending on different reading rules and/or phrase models.
  • the above uses the DNN model to identify the characteristics of the target image data, and determines whether the input file is a malicious file, that is, the DNN model uses the characteristics of the image data to identify the input file, and determines whether the input file is a malicious file.
  • the first determining module 1103 may be specifically configured to:
  • the input file is read according to a preset reading rule to obtain a plurality of characters; according to the preset phrase model, adjacent characters of the plurality of characters are combined to obtain a plurality of character strings.
  • the first building module 1104 may be specifically configured to:
  • the first building module 1104 may be specifically configured to:
  • the number of occurrences of the string is used as the value of the element corresponding to the string in the transfer matrix to obtain a transfer matrix
  • the sum of the number of occurrences of the string and the preset initial value is calculated, and the calculated sum value is used as the value of the element corresponding to the string in the transfer matrix to obtain a transfer matrix.
  • the second determining module 1105 may be specifically configured to:
  • the color depth of the image module corresponding to each element in the transfer matrix is calculated, and the target image data corresponding to the input component is obtained.
  • the second determining module 1105 may be specifically configured to:
  • the color depth of the image cell corresponding to the first element is determined according to the calculated ratio.
  • the first character string is a character string corresponding to the first element in the transfer matrix.
  • the second determining module 1105 may be specifically configured to:
  • the transition probability of the first element is determined according to the following formula:
  • the calculated transition probability of the first element is determined as the color depth of the image module corresponding to the first element.
  • the CNN model is based on the classical CNN Lenet-5 model
  • the first convolutional layer includes 32 convolution kernels
  • the second convolutional layer includes 64 convolution kernels
  • the second A 0.25 DropOut layer is added behind the pooling layer
  • a DropOut layer of 0.5 is added after the first fully connected layer.
  • the sample file is a sample malicious file
  • the feature extraction device may further include: a second building module, configured to extract, according to each sample file, an output result of a preset layer of the CNN model, and as a feature of the sample file, construct a malicious file according to the extracted multiple features.
  • a second building module configured to extract, according to each sample file, an output result of a preset layer of the CNN model, and as a feature of the sample file, construct a malicious file according to the extracted multiple features.
  • the feature extraction device may further include: a second identification module, configured to:
  • the feature of the preset layer output of the CNN model in the recognition model obtained by the pre-training of the file is extracted, and the feature of extracting the file is not required to be manually analyzed, thereby improving the efficiency of feature extraction and reducing the labor cost.
  • a malicious file feature library is constructed based on the extracted features of the malicious file, and the identified file is identified based on the malicious file feature library. Since the features included in the malicious file feature library are directly extracted from the malicious file, if the feature of the file to be identified matches the feature in the malicious file feature database, the file to be identified can be determined as a malicious file, which improves the accuracy of the file identification. Sex. In addition, compared with the calculation amount of the DNN model identification feature, the feature calculation in the matching malicious file feature library is much smaller, and the efficiency of file recognition is improved.
  • the embodiment of the present application further provides a network device, as shown in FIG. 12, including a processor 1201 and a machine readable storage medium 1202, which are stored and executable by the processor 1201.
  • Machine executable instructions The processor 1201 is caused by machine executable instructions to implement the file identification method illustrated in FIG. 1 above. Specifically, the processor 1201 is caused to be implemented by machine executable instructions:
  • the above-mentioned character string type is a type of a character string, and the types of character strings acquired are different depending on different reading rules and/or phrase models.
  • the processor 1201 is caused by machine executable instructions to specifically implement:
  • the file to be identified is read according to a preset reading rule to obtain a plurality of characters; according to the preset phrase model, adjacent characters of the plurality of characters are combined to obtain a plurality of character strings.
  • the processor 1201 is caused by machine executable instructions to specifically implement:
  • the processor 1201 is caused by machine executable instructions to specifically implement:
  • the number of occurrences of the string is used as the value of the element corresponding to the string in the transfer matrix to obtain a transfer matrix
  • the sum of the number of occurrences of the string and the preset initial value is calculated, and the calculated sum value is used as the value of the element corresponding to the string in the transfer matrix to obtain a transfer matrix.
  • the processor 1201 is caused by machine executable instructions to specifically implement:
  • the color depth of the image command grid corresponding to each element in the transfer matrix is calculated according to the value of each element in the transfer matrix, and the target image data corresponding to the file to be identified is obtained.
  • the processor 1201 is caused by machine executable instructions to specifically implement:
  • the color depth of the image cell corresponding to the first element is determined according to the calculated ratio.
  • the first character string is a character string corresponding to the first element in the transfer matrix.
  • the processor 1201 is caused by machine executable instructions to specifically implement:
  • the transition probability of the first element is determined according to the following formula:
  • the calculated transition probability of the first element is determined as the color depth of the image cell corresponding to the first element.
  • the processor 1201 is caused by machine executable instructions to specifically implement:
  • the target image data is input into the pre-trained CNN model to obtain the characteristics of the target image data; wherein the CNN model is based on the classical CNN Lenet-5 model, the first convolutional layer includes 32 convolution kernels, and the second convolution The layer consists of 64 convolution kernels, with a 0.25 DropOut layer behind the second pooled layer and a 0.5 DropOut layer behind the first fully connected layer.
  • the processor 1201 is caused by machine executable instructions to specifically implement:
  • the feature of the target image data is input into the pre-trained DNN model to obtain an output result; wherein the DNN model is used to identify the file by using the feature of the image data, determine whether the file corresponding to the image data is a malicious file, and output the result indicating the file to be identified. Whether it is a malicious file.
  • the feature of the target image data is an output result of a preset layer of the CNN model
  • processor 1201 is prompted by the machine executable instructions to be specifically implemented:
  • the preset malicious file feature library includes: a feature of the image data corresponding to the plurality of sample malicious files;
  • the file to be identified is a security file.
  • the feature of the image data corresponding to the plurality of sample malicious files may be obtained by inputting the image data corresponding to the sample malicious file into the CNN model for each sample malicious file, and the CNN model is The preset layer corresponds to the output result as a feature of the corresponding image data.
  • the network device may further include: a communication interface 1203 and a communication bus 1204; wherein the processor 1201, the machine readable storage medium 1202, and the communication interface 1203 complete each other through the communication bus 1204.
  • the communication interface 1203 is used for communication between the above network device and other devices.
  • the file to be identified when file identification is performed, the file to be identified is converted into image data, and the feature of the image data is extracted, and according to the extracted feature, whether the file to be identified is a malicious file is determined.
  • the image data is characterized by the objective existence of the file to be identified, and is not set according to experience, and the file recognition result is obtained according to the objective existence feature, which reduces the dependence of the file recognition on the subjective factors of the person and improves the file recognition. The accuracy.
  • the embodiment of the present application further provides a network device, as shown in FIG. 13, including a processor 1301 and a machine readable storage medium 1302, which are stored and executable by the processor 1301.
  • Machine executable instructions The processor 1301 is caused by machine executable instructions to implement the file identification method illustrated in FIG. 7 above. Specifically, the processor 1301 is caused to be implemented by machine executable instructions:
  • the file identification model is configured to: determine a plurality of character strings corresponding to the input file according to the preset reading rule and the preset phrase model; construct a transfer matrix according to the plurality of character strings; and use the element and the string type in the transfer matrix Corresponding to; determining the target image data corresponding to the input file according to the elements in the transfer matrix; extracting features of the target image data, and determining whether the input file is a malicious file according to the characteristics of the target image data.
  • the above-mentioned character string type is a type of a character string, and the types of character strings acquired are different depending on different reading rules and/or phrase models.
  • the processor 1301 is caused by machine executable instructions to specifically implement:
  • the input file is read according to a preset reading rule to obtain a plurality of characters; according to the preset phrase model, adjacent characters of the plurality of characters are combined to obtain a plurality of character strings.
  • the processor 1301 is caused by machine executable instructions to specifically implement:
  • the processor 1301 is caused by machine executable instructions to specifically implement:
  • the number of occurrences of the string is used as the value of the element corresponding to the string in the transfer matrix to obtain a transfer matrix
  • the sum of the number of occurrences of the string and the preset initial value is calculated, and the calculated sum value is used as the value of the element corresponding to the string in the transfer matrix to obtain a transfer matrix.
  • the processor 1301 is caused by machine executable instructions to specifically implement:
  • the color depth of the image command grid corresponding to each element in the transfer matrix is calculated according to the value of each element in the transfer matrix, and the target image data corresponding to the input file is obtained.
  • the processor 1301 is caused by machine executable instructions to specifically implement:
  • the color depth of the image cell corresponding to the first element is determined according to the calculated ratio.
  • the first character string is a character string corresponding to the first element in the transfer matrix.
  • the processor 1301 is caused by machine executable instructions to specifically implement:
  • the transition probability of the first element is determined according to the following formula:
  • the calculated transition probability of the first element is determined as the color depth of the image cell corresponding to the first element.
  • the processor 1301 is caused by machine executable instructions to specifically implement:
  • the CNN model is based on the classic CNN Lenet-5 model.
  • the first convolutional layer consists of 32 convolution kernels
  • the second convolutional layer consists of 64 convolution kernels
  • the second pooled layer adds 0.25.
  • a DropOut layer 0.5 is added after the first fully connected layer.
  • the processor 1301 is caused by machine executable instructions to specifically implement:
  • the feature of the target image data is input into the pre-trained DNN model to obtain an output result; wherein the DNN model is used to identify the file by using the feature of the image data to determine whether the file corresponding to the image data is a malicious file, and the output result indicates whether the input file is For malicious files.
  • the feature of the target image data is an output result of a preset layer of the CNN model
  • processor 1301 is prompted by the machine executable instructions to be specifically implemented:
  • the preset malicious file feature library includes: a feature of the image data corresponding to the plurality of sample malicious files;
  • the feature of the image data corresponding to the plurality of sample malicious files may be obtained by inputting the image data corresponding to the sample malicious file into the CNN model for each sample malicious file, and the CNN model is The preset layer corresponds to the output result as a feature of the corresponding image data.
  • the network device may further include: a communication interface 1303 and a communication bus 1304; wherein the processor 1301, the machine readable storage medium 1302, and the communication interface 1303 complete each other through the communication bus 1304.
  • the communication interface 1303 is used for communication between the above network device and other devices.
  • the file to be identified when file identification is performed, the file to be identified is converted into image data, and the feature of the image data is extracted, and according to the extracted feature, whether the file to be identified is a malicious file is determined.
  • the image data is characterized by the objective existence of the file to be identified, and is not set according to experience, and the file recognition result is obtained according to the objective existence feature, which reduces the dependence of the file recognition on the subjective factors of the person and improves the file recognition. The accuracy.
  • the embodiment of the present application further provides a network device, as shown in FIG. 14, including a processor 1401 and a machine readable storage medium 1402, which are stored and executable by the processor 1401.
  • Machine executable instructions The processor 1401 is caused by machine executable instructions to implement the feature extraction method illustrated in FIG. 8 above. Specifically, the processor 1401 is caused to be implemented by machine executable instructions:
  • the plurality of sample files are respectively input into the file recognition model; wherein the file recognition model is configured to: determine, according to the preset reading rule and the preset phrase model, a plurality of character strings corresponding to the input file; and according to the plurality of strings corresponding to the input file , constructing a transfer matrix, the elements in the transfer matrix correspond one-to-one with the type of the string; determining the target image data corresponding to the input file according to the elements in the transfer matrix; extracting the characteristics of the target image data corresponding to the input file by using the CNN model, and utilizing The DNN model identifies the characteristics of the target image data to determine whether the input file is a malicious file;
  • the output of the preset layer of the CNN model is extracted as a feature of the sample file.
  • the above-mentioned character string type is a type of a character string, and the types of character strings acquired are different depending on different reading rules and/or phrase models.
  • the above uses the DNN model to identify the characteristics of the target image data, and determines whether the input file is a malicious file, that is, the DNN model uses the characteristics of the image data to identify the input file, and determines whether the input file is a malicious file.
  • the processor 1401 is caused by machine executable instructions to specifically implement:
  • the input file is read according to a preset reading rule to obtain a plurality of characters; according to the preset phrase model, adjacent characters of the plurality of characters are combined to obtain a plurality of character strings.
  • the processor 1401 is caused by machine executable instructions to specifically implement:
  • the processor 1401 is caused by machine executable instructions to specifically implement:
  • the number of occurrences of the string is used as the value of the element corresponding to the string in the transfer matrix to obtain a transfer matrix
  • the sum of the number of occurrences of the string and the preset initial value is calculated, and the calculated sum value is used as the value of the element corresponding to the string in the transfer matrix to obtain a transfer matrix.
  • the processor 1401 is caused by machine executable instructions to specifically implement:
  • the color depth of the image command grid corresponding to each element in the transfer matrix is calculated according to the value of each element in the transfer matrix, and the target image data corresponding to the input file is obtained.
  • the processor 1401 is caused by machine executable instructions to specifically implement:
  • the color depth of the image cell corresponding to the first element is determined according to the calculated ratio.
  • the first character string is a character string corresponding to the first element in the transfer matrix.
  • the processor 1401 is caused by machine executable instructions to specifically implement:
  • the transition probability of the first element is determined according to the following formula:
  • the calculated transition probability of the first element is determined as the color depth of the image cell corresponding to the first element.
  • the CNN model is based on the classical CNN Lenet-5 model
  • the first convolutional layer includes 32 convolution kernels
  • the second convolutional layer includes 64 convolution kernels
  • the second A 0.25 DropOut layer is added behind the pooling layer
  • a DropOut layer of 0.5 is added after the first fully connected layer.
  • the sample file is a sample malicious file
  • the processor 1401 is caused by the machine executable instructions to further implement: extracting, for each sample file, an output result of the preset layer of the CNN model, and as a feature of the sample file, constructing a malicious file according to the acquired multiple features.
  • Feature Library is caused by the machine executable instructions to further implement: extracting, for each sample file, an output result of the preset layer of the CNN model, and as a feature of the sample file, constructing a malicious file according to the acquired multiple features.
  • the machine executable instructions may further include: a second identification instruction
  • the processor 1401 is prompted by the machine executable instructions to: input the file to be identified into the file recognition model; obtain the output result of the preset layer of the CNN model in the file recognition model as the target feature; and find the target feature from the malicious file feature database If found, it is determined that the file to be identified is a malicious file; if not found, it is determined that the file to be identified is a security file.
  • the network device may further include: a communication interface 1403 and a communication bus 1404; wherein the processor 1401, the machine readable storage medium 1402, and the communication interface 1403 complete each other through the communication bus 1404.
  • Inter-communication, communication interface 1403 is used for communication between the above network device and other devices.
  • the feature of the preset layer output of the CNN model in the recognition model obtained by the pre-training of the file is extracted, and the feature of extracting the file is not required to be manually analyzed, thereby improving the efficiency of feature extraction and reducing the labor cost.
  • a malicious file feature library is constructed based on the extracted features of the malicious file, and the identified file is identified based on the malicious file feature library. Since the features included in the malicious file feature library are directly extracted from the malicious file, if the feature of the file to be identified matches the feature in the malicious file feature database, the file to be identified can be determined as a malicious file, which improves the accuracy of the file identification. Sex. In addition, compared with the calculation amount of the DNN model identification feature, the feature calculation in the matching malicious file feature library is much smaller, and the efficiency of file recognition is improved.
  • the communication bus may be a Peripheral Component Interconnect (PCI) bus or an Extended Industry Standard Architecture (EISA) bus.
  • PCI Peripheral Component Interconnect
  • EISA Extended Industry Standard Architecture
  • the communication bus can be divided into an address bus, a data bus, a control bus, and the like.
  • the machine readable storage medium may include a random access memory (English: Random Access Memory, RAM for short), and may also include a non-volatile memory (Non-Volatile Memory, NVM for short), such as at least one disk storage. . Additionally, the machine readable storage medium can also be at least one storage device located remotely from the aforementioned processor.
  • the processor may be a general-purpose processor, including a central processing unit (English: Central Processing Unit, CPU for short), a network processor (English: Network Processor, NP for short), or a digital signal processor (English: Digital Signal Processing (DSP), Application Specific Integrated Circuit (ASIC), Field-Programmable Gate Array (FPGA) or other programmable logic devices. Discrete gate or transistor logic, discrete hardware components.
  • CPU Central Processing Unit
  • NP Network Processor
  • DSP Digital Signal Processing
  • ASIC Application Specific Integrated Circuit
  • FPGA Field-Programmable Gate Array
  • the embodiment of the present application further provides a machine readable storage medium storing machine executable instructions.
  • the machine executable instructions When being called and executed by a processor, the machine executable instructions cause the processor to implement the foregoing FIG. 1 .
  • File identification method Specifically, machine executable instructions cause the processor to implement:
  • the above-mentioned character string type is a type of a character string, and the types of character strings acquired are different depending on different reading rules and/or phrase models.
  • the file to be identified when file identification is performed, the file to be identified is converted into image data, and the feature of the image data is extracted, and according to the extracted feature, whether the file to be identified is a malicious file is determined.
  • the image data is characterized by the objective existence of the file to be identified, and is not set according to experience, and the file recognition result is obtained according to the objective existence feature, which reduces the dependence of the file recognition on the subjective factors of the person and improves the file recognition. The accuracy.
  • the embodiment of the present application further provides a machine readable storage medium storing machine executable instructions.
  • the machine executable instructions When being called and executed by a processor, the machine executable instructions cause the processor to implement the foregoing FIG. File identification method. Specifically, machine executable instructions cause the processor to implement:
  • the file identification model is configured to: determine a plurality of character strings corresponding to the input file according to the preset reading rule and the preset phrase model; construct a transfer matrix according to the plurality of character strings; and use the element and the string type in the transfer matrix Corresponding to; determining the target image data corresponding to the input file according to the elements in the transfer matrix; extracting features of the target image data, and determining whether the input file is a malicious file according to the characteristics of the target image data.
  • the above-mentioned character string type is a type of a character string, and the types of character strings acquired are different depending on different reading rules and/or phrase models.
  • the file to be identified when file identification is performed, the file to be identified is converted into image data, and the feature of the image data is extracted, and according to the extracted feature, whether the file to be identified is a malicious file is determined.
  • the image data is characterized by the objective existence of the file to be identified, and is not set according to experience, and the file recognition result is obtained according to the objective existence feature, which reduces the dependence of the file recognition on the subjective factors of the person and improves the file recognition. The accuracy.
  • the embodiment of the present application further provides a machine readable storage medium storing machine executable instructions.
  • the machine executable instructions When being called and executed by a processor, the machine executable instructions cause the processor to implement the foregoing FIG. Feature extraction method. Specifically, machine executable instructions cause the processor to implement:
  • the plurality of sample files are respectively input into the file recognition model; wherein the file recognition model is configured to: determine, according to the preset reading rule and the preset phrase model, a plurality of character strings corresponding to the input file; and according to the plurality of strings corresponding to the input file , constructing a transfer matrix, the elements in the transfer matrix correspond one-to-one with the type of the string; determining the target image data corresponding to the input file according to the elements in the transfer matrix; extracting the characteristics of the target image data corresponding to the input file by using the CNN model, and utilizing The DNN model identifies the characteristics of the target image data to determine whether the input file is a malicious file;
  • the output of the preset layer of the CNN model is extracted as a feature of the sample file.
  • the above-mentioned character string type is a type of a character string, and the types of character strings acquired are different depending on different reading rules and/or phrase models.
  • the feature of the preset layer output of the CNN model in the recognition model obtained by the pre-training of the file is extracted, and the feature of extracting the file is not required to be manually analyzed, thereby improving the efficiency of feature extraction and reducing the labor cost.
  • a malicious file feature library is constructed based on the extracted features of the malicious file, and the identified file is identified based on the malicious file feature library. Since the features included in the malicious file feature library are directly extracted from the malicious file, if the feature of the file to be identified matches the feature in the malicious file feature database, the file to be identified can be determined as a malicious file, which improves the accuracy of the file identification. Sex. In addition, compared with the calculation amount of the DNN model identification feature, the feature calculation in the matching malicious file feature library is much smaller, and the efficiency of file recognition is improved.
  • the embodiment of the present application further provides a machine executable instruction that, when called and executed by a processor, causes the processor to implement the file identification method shown in FIG. 1 above.
  • machine executable instructions cause the processor to implement:
  • the above-mentioned character string type is a type of a character string, and the types of character strings acquired are different depending on different reading rules and/or phrase models.
  • the file to be identified when file identification is performed, the file to be identified is converted into image data, and the feature of the image data is extracted, and according to the extracted feature, whether the file to be identified is a malicious file is determined.
  • the image data is characterized by the objective existence of the file to be identified, and is not set according to experience, and the file recognition result is obtained according to the objective existence feature, which reduces the dependence of the file recognition on the subjective factors of the person and improves the file recognition. The accuracy.
  • the embodiment of the present application further provides a machine executable instruction that, when called and executed by a processor, causes the processor to implement the file identification method shown in FIG. 7 above.
  • the machine executable instructions cause the processor to: acquire the file to be identified;
  • the file identification model is configured to: determine a plurality of character strings corresponding to the input file according to the preset reading rule and the preset phrase model; construct a transfer matrix according to the plurality of character strings; and use the element and the string type in the transfer matrix Corresponding to; determining the target image data corresponding to the input file according to the elements in the transfer matrix; extracting features of the target image data, and determining whether the input file is a malicious file according to the characteristics of the target image data.
  • the above-mentioned character string type is a type of a character string, and the types of character strings acquired are different depending on different reading rules and/or phrase models.
  • the file to be identified when file identification is performed, the file to be identified is converted into image data, and the feature of the image data is extracted, and according to the extracted feature, whether the file to be identified is a malicious file is determined.
  • the image data is characterized by the objective existence of the file to be identified, and is not set according to experience, and the file recognition result is obtained according to the objective existence feature, which reduces the dependence of the file recognition on the subjective factors of the person and improves the file recognition. The accuracy.
  • the embodiment of the present application further provides a machine executable instruction that, when invoked and executed by a processor, causes the processor to implement the feature extraction method shown in FIG. 8 above.
  • machine executable instructions cause the processor to implement:
  • the plurality of sample files are respectively input into the file recognition model; wherein the file recognition model is configured to: determine, according to the preset reading rule and the preset phrase model, a plurality of character strings corresponding to the input file; and according to the plurality of strings corresponding to the input file , constructing a transfer matrix, the elements in the transfer matrix correspond one-to-one with the type of the string; determining the target image data corresponding to the input file according to the elements in the transfer matrix; extracting the characteristics of the target image data corresponding to the input file by using the CNN model, and utilizing The DNN model identifies the characteristics of the target image data to determine whether the input file is a malicious file;
  • the output of the preset layer of the CNN model is extracted as a feature of the sample file.
  • the above-mentioned character string type is a type of a character string, and the types of character strings acquired are different depending on different reading rules and/or phrase models.
  • the feature of the preset layer output of the CNN model in the recognition model obtained by the pre-training of the file is extracted, and the feature of extracting the file is not required to be manually analyzed, thereby improving the efficiency of feature extraction and reducing the labor cost.
  • a malicious file feature library is constructed based on the extracted features of the malicious file, and the identified file is identified based on the malicious file feature library. Since the features included in the malicious file feature library are directly extracted from the malicious file, if the feature of the file to be identified matches the feature in the malicious file feature database, the file to be identified can be determined as a malicious file, which improves the accuracy of the file identification. Sex. In addition, compared with the calculation amount of the DNN model identification feature, the feature calculation in the matching malicious file feature library is much smaller, which improves the efficiency of file recognition.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Security & Cryptography (AREA)
  • Software Systems (AREA)
  • Computer Hardware Design (AREA)
  • General Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Virology (AREA)
  • Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Image Analysis (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

一种文件识别方法和特征提取方法,文件识别方法包括:获取待识别文件(101);根据预设读取规则和预设词组模型,确定待识别文件对应的多个字符串(102);根据多个字符串构建转移矩阵(103),其中,所述转移矩阵中的元素与字符串种类一一对应;根据转移矩阵中的元素,确定待识别文件对应的目标图像数据(104);提取目标图像数据的特征,并根据目标图像数据的特征,确定待识别文件是否为恶意文件(105)。

Description

[根据细则37.2由ISA制定的发明名称] 文件识别方法和特征提取方法
本申请要求于2018年4月18日提交中国专利局、申请号为201810349458.6发明名称为“文件识别方法和特征提取方法”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
背景技术
恶意代码为攻击者进行攻击的一种形式。携带恶意代码的文件为恶意文件,也就是,恶意文件为攻击者进行攻击的一种形式。恶意文件利用网络服务的漏洞攻击网络服务器,达到窃取信息、瘫痪服务等目的。
为了提高网络安全,保证服务质量,对准确的识别出恶意文件提出了要求。目前,文件识别的过程包括:获取待识别文件,将待识别文件在沙箱中运行,提取待识别文件的运行特征,对提取的运行特征进行归一化处理,将归一化处理后的运行特征输入深度神经网络(英文:Deep Neutral Network,简称:DNN)模型,得到待识别文件为非恶意文件的概率,以及待识别文件为恶意文件的概率,进而确定待识别文件是否为恶意文件。例如,若待识别文件为非恶意文件的概率大于待识别文件为恶意文件的概率,则确定待识别文件为非恶意文件;否则,确定待识别文件为恶意文件。其中,DNN模型为利用文件的运行特征训练得到的。
附图简要说明
图1为本申请实施例提供的文件识别方法的第一种流程示意图;
图2为本申请实施例提供的转移矩阵的第一种示意图;
图3为本申请实施例提供的转移矩阵的第二种示意图;
图4为基于图3所示转移矩阵的图像数据的一种示意图;
图5为本申请实施例提供的卷积神经网络模型的一种结构示意图;
图6为本申请实施例提供的模型训练方法的一种流程示意图;
图7为本申请实施例提供的文件识别方法的第二种流程示意图;
图8为本申请实施例提供的特征提取方法的一种流程示意图;
图9为本申请实施例提供的文件识别装置的第一种结构示意图;
图10为本申请实施例提供的文件识别装置的第二种结构示意图;
图11为本申请实施例提供的特征提取装置的一种结构示意图;
图12为本申请实施例提供的网络设备的第一种结构示意图;
图13为本申请实施例提供的网络设备的第二种结构示意图;
图14为本申请实施例提供的网络设备的第三种结构示意图。
具体实施方式
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。
目前,在文件识别过程中,待识别文件在沙箱中运行提取的运行特征是用户根据经验设定的,也就是,文件识别依赖于人的主观因素,文件识别的准确性较低。
为了提高文件识别的准确性,本申请实施例提供了一种文件识别方法。该文件识别方法可以应用于 网络设备,如防火墙设备、路由器、交换机等。该方法还可以由文件识别装置来执行,该装置可通过硬件和/或软件的方式实现,并一般可集成于用于文件识别的网络设备中。
本申请实施例提供的文件识别方法,将待识别文件转换为图像数据,提取图像数据的特征,之后,根据提取的特征,确定待识别文件是否为恶意文件。其中,图像数据的特征是待识别文件中客观存在的特征,而不是根据经验设定的,依据此客观存在的特征得到文件识别结果,降低了文件识别对人的主观因素的依赖,提高了文件识别的准确性。因此本申请实施例提供的文件识别方法更为准确。
下面通过具体实施例,对本申请进行详细说明。
参考图1,图1为本申请实施例提供的文件识别方法的第一种流程示意图,该方法包括如下过程。
在101部分:获取待识别文件。
以文件识别方法的执行主体为网络设备为例。网络设备获取到的待识别文件可以是:其他网络设备发送给该网络设备的文件。网络设备获取到的待识别文件也可以是:从本地存储的文件中获取的文件。
在102部分:根据预设读取规则和预设词组模型,确定待识别文件对应的多个字符串。
在本申请的一个实施例中,根据预设读取规则和预设词组模型,确定待识别文件对应的多个字符串,可以包括:按照预设读取规则读取待识别文件,得到多个字符,按照预设词组模型组合多个字符中相邻的字符,得到多个字符串。
其中,读取规则可以包括:2进制、8进制或16进制,但不限于这几种读取规则。预设词组模型可以包括二元词组(英文:BiGram)模型和/或三元词语(英文:TriGram)模型。
在103部分:根据待识别文件对应的多个字符串,构建转移矩阵。其中,转移矩阵中的元素与字符串种类一一对应。字符串种类为字符串的种类,根据不同的读取规则和/或词组模型,获取的字符串的种类也不相同。
在本申请的一个实施例中,根据待识别文件对应的多个字符串,构建转移矩阵,可以包括:确定每一字符串在多个字符串中的出现次数,根据每一字符串的出现次数构建转移矩阵。可选的,转移矩阵的行数和列数相同,转移矩阵的行数和列数均为:字符串种类数与字符种类数的比值。其中,字符串种类数为:根据预设读取规则和预设词组模型确定字符串时,获取到的字符串的种类数;字符种类数为:根据预设读取规则读取文件时,获取到的字符的种类数。
例如,预设读取规则为16进制,预设词组模型包括BiGram模型和TriGram模型。
按照16进制读取文件时,可以获取到1-F这16种字符。
按照BiGram模型,组合获取到16种字符中任意两种字符,可以获取到16*16=256种字符串。
按照TriGram模型,组合获取到16种字符中任意三种字符,可以获取到16*16*16=4096种字符串。
按照转移矩阵的行数和列数相同,以及转移矩阵中的元素与字符串种类一一对应的规则,转移矩阵的行数和列数可以为:(256+4096)/16=272。也就是,可以根据待识别文件对应的每一字符串的出现次数,构建272*272的转移矩阵。
在本申请的一个实施例中,根据每一字符串的出现次数,构建转移矩阵,可以包括:针对每一字符串,将该字符串的出现次数作为转移矩阵中该字符串对应的元素的值,得到转移矩阵。
以预设词组模型为BiGram模型为例进行说明。例如,网络设备获取到待识别文件f1,按照预设读取规则读取待识别文件f1,得到多个字符:abcbbcdabcd。
根据BiGram模型,组合待识别文件f1对应的多个字符中相邻的字符,得到的多个字符串为:ab,bc, cb,bb,bc,cd,da,ab,bc,cd。各个字符串的出现次数为:“ab”的出现次数为2,“bc”的出现次数为3,“cb”的出现次数为1,“bb”的出现次数为1,“cd”的出现次数为2,“da”的出现次数为1。其他字符串的出现次数为0。
根据上面确定的出现次数,确定转移矩阵中每一字符串对应的元素的值,得到转移矩阵1,如图2所示,图2中每一方格代表矩阵的一个元素,由方格对应的横向字符与该方格对应的纵向字符组成一个字符串,该字符串为该方格对应的字符串。
在本申请的另一个实施例中,为了提高平滑度和防止过拟合,根据每一字符串的出现次数,构建转移矩阵,可以包括:针对每一字符串,计算该字符串的出现次数与预设初始值的和值,将计算得到的和值作为转移矩阵中该字符串对应的元素的值,得到转移矩阵。
仍以上面的例子进行说明,若预设初始值为10,待识别文件f1对应的各个字符串的出现次数为:
“ab”的出现次数为2+10=12,
“bc”的出现次数为3+10=13,
“cb”的出现次数为1+10=11,
“bb”的出现次数为1+10=11,
“cd”的出现次数为2+10=12,
“da”的出现次数为1+10=11,
其他字符串的出现次数为0+10=10。
根据上面确定的出现次数,确定转移矩阵中每一字符串对应的元素的值,得到转移矩阵2,如图3所示,图3中每一方格代表矩阵的一个元素,由方格对应的横向字符与该方格对应的纵向字符组成一个字符串,该字符串为该方格对应的字符串。
在104部分:根据转移矩阵中的元素,确定待识别文件对应的目标图像数据。
在确定转移矩阵后,根据转移矩阵中的元素,确定待识别文件对应的目标图像数据。
在本申请的一个实施例中,转移矩阵中的一个元素对应一个图像单元格,确定待识别文件对应的目标图像数据,即将转移矩阵中的每个元素的值转换成图像数据。具体的,根据转移矩阵中各元素的值,计算转移矩阵中各元素对应的图像单元格的颜色深度,得到待识别文件对应的目标图像数据。至此,完成了“待识别文件→字符/字符串→转移矩阵→图像数据”的转换过程。
上述图像单元格为图像处理的最小单元。颜色深度是指黑白图像中点的灰度值。本申请实施例中,将颜色深度作为图像单元格的值。颜色深度的范围一般从0到255,白色为255,黑色为0。本申请实施例中不限定颜色深度的范围,即颜色深度可以为整数,可以为小数,可以为正数,也可以为负数。
可选的,对于转移矩阵中的任一元素,可以采用以下方式确定各元素对应的图像单元格的颜色深度。
具体的,针对转移矩阵中的第一元素,确定第一元素的值为第一数值;其中,第一元素为转移矩阵中的任一元素,第一元素的值根据第一字符串的出现次数确定。第一字符串为转移矩阵中第一元素对应的字符串。
确定所有第二元素的值之和为第二数值。其中,第二元素的值根据第二字符串的出现次数确定,第二字符串的头部词与第一字符串的头部词相同。这里,第二字符串中包括了第一字符串。头部词即为第一个字符。
计算第一数值与第二数值的比值。
之后,根据计算得到的比值,确定转移矩阵中第一元素对应的图像单元格的颜色深度。
一种实现方式中,针对转移矩阵中的每一元素(例如第一元素),可以将计算得到的比值(即第一数值与第二数值的比值),作为转移矩阵中第一元素对应的图像单元格的颜色深度。
另一种实现方式中,针对转移矩阵中的每一元素(例如第一元素),可以根据以下公式确定第一元素的转移概率:
h=Log T。         (1)
其中,h为第一元素的转移概率,T为计算得到的比值,即第一数值与第二数值的比值。例如,字符串“xy”的出现次数为T xy,头部词为x的字符串的出现次数之和T x,则T=T xy/T x
将计算得到的第一元素的转移概率,确定为第一元素对应的图像单元格的颜色深度。
以图3所示的转移矩阵为例,根据公式(1),可以确定转移矩阵2中:
“ab”对应的元素的转移概率为:h ab=Log[T ab/T a]=Log[12/(10+12+10+10+10)]=-0.639。
“aa”、“ac”、“ad”和“ae”对应的元素的转移概率为:h a=Log[T ax/T a]=Log[10/(10+12+10+10+10)]=-0.716。
“bb”对应的元素的转移概率为:h bb=Log[T bb/T b]=Log[11/(10+11+13+10+10)]=-0.691。
“bc”对应的元素的转移概率为:h bc=Log[T bc/T b]=Log[13/(10+11+13+10+10)]=-0.618。
“ba”、“bd”和“be”对应的元素的转移概率为:h b=Log[T bx/T b]=Log[10/(10+11+13+10+10)]=-0.732。
“cb”对应的元素的转移概率为:h cb=Log[T cb/T c]=Log[11/(10+11+10+12+10)]=-0.683。
“cd”对应的元素的转移概率为:h cd=Log[T cd/T c]=Log[12/(10+11+10+12+10)]=-0.645。
“ca”、“cc”和“ce”对应的元素的转移概率为:h c=Log[T cx/T c]=Log[10/(10+11+10+12+10)]=-0.724。
“da”对应的元素的转移概率为:h da=Log[T da/T d]=Log[11/(11+10+10+10+10)]=-0.666。
“db”、“dc”、“dd”和“de”对应的元素的转移概率为:h d=Log[T dx/T d]=Log[10/(11+10+10+10+10)]=-0.708。
“ea”“eb”、“ec”、“ed”和“ee”对应的元素的转移概率为:h e=Log[T ex/T e]=Log[10/(10+10+10+10+10)]=-0.699。
确定每一元素的转移概率,也就是,确定了每一元素对应的图像单元格的颜色深度。获得了各图像单元格的颜色深度,也就确定了图像数据,如图4所示。
在105部分:提取目标图像数据的特征,并根据目标图像数据的特征,确定待识别文件是否为恶意文件。
在本申请的一个实施例中,可以采用卷积神经网络(英文:Constitutional Neural Networks,简称:CNN)模型提取目标图像数据的特征。可选地,为了获得更为适用于文件识别的CNN模型,本申请实施例采用的CNN模型可以以经典CNN Lenet-5模型为基础,在经典CNN Lenet-5结构的基础上进行改进得到。其中,Lenet-5为一种经典的CNN网络架构,包括3个卷积层、2个池化层和2个全连接层。一种实现方式中,对Lenet-5结构的改进,如图5所示。
01、第一个卷积层包括32个卷积核,第二个卷积层包括64个卷积核。
02、第二个池化层后面增加0.25的丢弃(英文:DropOut)层,第一个全连接层后面增加0.5的DropOut层。其中,DropOut层又可以称为Discard层。
在本申请的一个实施例中,可以采用DNN模型对目标图像数据的特征进行识别,也就是,采用DNN 模型,利用目标图像数据的特征对待识别文件进行识别,确定待识别文件是否为恶意文件。具体的,将目标图像数据的特征输入预先训练的DNN模型,得到输出结果,其中,输出结果指示待识别文件是否为恶意文件。具体的,输出结果指示待识别文件为恶意文件,或者输出结果指示待识别文件为非恶意文件。非恶意文件即为安全文件。
例如,将目标图像数据的特征输入DNN模型,得到待识别文件为安全文件的第一概率,以及待识别文件为恶意文件的第二概率。若第一概率大于第二概率,则DNN模型的输出结果指示待识别文件为安全文件。否则,DNN模型的输出结果指示待识别文件为恶意文件。
本申请实施例中,利用图像数据的特征确定待识别文件是否为恶意文件。图像数据的特征是待识别文件客观存在的特征,而不是根据经验设定的,依据此客观存在的特征的识别结果,降低了文件识别对人的主观因素的依赖,提高了文件识别的准确性。
本申请实施例中,为了提高文件识别的准确性,在对待识别文件进行识别前,可预先训练DNN模型和CNN模型。具体的可参考图6所示的模型训练方法的一种流程示意图。该方法包括如下过程。
在601部分:针对预设DNN模型,初始化该DNN模型的参数集中的参数,初始化的参数集可以由θ i表示。为了加快DNN模型的训练,初始化的参数可以根据实际需要和经验进行设置。i为当前已进行前向计算的次数计数/累计次数。
在602部分:针对预设CNN模型,初始化该CNN模型的参数集中的参数,初始化的参数集可以由
Figure PCTCN2019083200-appb-000001
表示。为了加快CNN模型的训练,初始化的参数可以根据实际需要和经验进行设置。i为当前已进行前向计算的次数计数/累计次数。
在601或602部分中,还可以对训练相关的高层参数,如学习率、梯度下降算法、反向传播算法等,进行设置。具体可以采用相关技术中的各种方式设置训练相关的高层参数,在此不再进行详细描述。
在603部分:获取预设训练集。预设训练集包括样本文件、以及样本文件的标签,标签可以包括:用于指示文件为恶意文件的第一标签和用于指示文件为非恶意文件的第二标签。样本文件可以为二进制文件。
预设训练集包括的样本文件可以通过网络爬虫等从网络中获取到,也可以从预先获取的样本文件库中获取,本申请实施例对此不进行限定。
为了提高训练获得的CNN模型和DNN模型准确可靠,预设训练集中包括的样本文件越多越好。
本申请实施例中不限定601、602和603部分的执行顺序。
在604部分:将预设训练集中每一样本文件转换为图像数据。
将样本文件转换为图像数据的步骤,可以参考上述将待识别文件转换为目标图像数据的过程,此处不再赘述。
在605部分:进行前向计算,具体如下。
将604部分中获得的每一样本文件的图像数据输入预设CNN模型,得到该样本文件对应的图像数据的特征。将预设CNN模型输出的特征输入预设DNN模型,得到该样本文件对应的输出结果。输出结果指示该样本文件为安全文件,或指示该样本文件为恶意文件。
例如,将一样本文件对应的图像数据的特征输入预设DNN模型进行处理过程中,得到样本文件为安全文件的第三概率,以及样本文件为恶意文件的第四概率。若第三概率大于第四概率,则确定该样本文件对应的输出结果为该样本文件为该安全文件;否则,确定该样本文件对应的输出结果为该样本文件 为恶意文件。
第一次进入本605部分处理时,当前参数集为θ 1
Figure PCTCN2019083200-appb-000002
后续再次进入本605部分处理时,当前参数集θ i为对上一次使用的参数集θ i-1进行调整后得到的,当前参数集
Figure PCTCN2019083200-appb-000003
为对上一次使用的参数集
Figure PCTCN2019083200-appb-000004
进行调整后得到的,详见后续描述。
在606部分:基于各样本文件的标签和预设DNN模型对应的输出结果,计算损失值。
一个例子中,可以使用均方误差(英文:Mean Squared Error,简称:MSE)公式作为损失函数,得到损失值L(θ i),详见如下公式:
Figure PCTCN2019083200-appb-000005
其中,H表示单次训练中从预设训练集中选取的样本文件个数,I j表示第j个样本文件对应的图像数据的特征,F(I ji)表示针对第j个样本文件、DNN模型在参数集θ i下前向计算得到的输出结果,X j表示第j个样本文件的标签,i为当前已进行前向计算的次数计数/累计次数。
在607部分:基于损失值,确定采用当前参数集的预设模型是否收敛。其中,预设模型包括CNN模型和预设DNN模型。
如果预设模型不收敛,进入608部分;如果预设模型收敛,进入609部分。
例如,可以当损失值小于预设损失值阈值时,确定收敛;也可以当本次计算得到损失值与上一次计算得到的损失值之差小于预设变化阈值时,确定收敛,本申请实施例在此不做限定。
在608部分:对当前参数集θ i
Figure PCTCN2019083200-appb-000006
中的参数进行调整,得到调整后的参数集,然后进入605部分,用于下一次前向计算。
具体可以利用反向传播算法对当前参数集中的参数进行调整。
在609部分:将当前参数集θ i作为输出的最终参数集θ final,将当前参数集
Figure PCTCN2019083200-appb-000007
作为输出的最终参数集
Figure PCTCN2019083200-appb-000008
将采用最终参数集θ final的该预设DNN模型,作为训练完成的DNN模型。将采用最终参数集
Figure PCTCN2019083200-appb-000009
的该预设CNN模型,作为训练完成的CNN模型。
上述CNN模型和DNN模型的训练可以与文件识别在同一网络设备上实现。为了保证降低对文件识别的网络设备的影响,CNN模型和DNN模型的训练的网络设备可以与文件识别的网络设备不同。
在本申请的一个实施例中,可以采用恶意文件特征库对目标图像数据的特征进行识别,确定待识别文件是否为恶意文件。其中,恶意文件特征库包括:多个样本恶意文件对应的图像数据的特征。具体的,将目标图像数据输入CNN模型,获取CNN模型的预设层的输出结果作为目标图像数据的特征。从预设的恶意文件特征库中查找目标图像数据的特征。若查找到,则确定待识别文件为恶意文件。若未查找到,则确定待识别文件为安全文件。
为了进一步提高文件识别的准确性,提高文件识别的效率,一个可选的实施例中,在预先训练获得了CNN模型后,可以将样本恶意文件对应的图像数据输入CNN模型,获取CNN模型的预设层的输出结果,将CNN模型的预设层的输出结果作为样本恶意文件对应的图像数据的特征。由多个样本恶意文件对应的图像数据的特征,构建恶意文件特征库。
可选的,为了避免图像数据的特征过长,增加文件识别的计算量,同时,为了避免图像数据的特征过短,降低文件识别的准确性,预设层可以为CNN模型的第三个卷积层,如图4所示。可选的,第三个卷积层输出的特征长度为512字节。
由于恶意文件特征库中的特征是从恶意文件中直接提取到的,若待识别文件的特征与恶意文件特征库中的特征匹配,可以确定待识别文件为恶意文件,提高了文件识别的准确性。另外,相较于DNN模型识别特征的计算量,匹配恶意文件特征库中的特征计算量要小很多,提高了文件识别的效率。
基于相同的发明构思,本申请实施例还提供了一种文件识别方法。参考图7,图7为本申请实施例提供的文件识别方法的第二种流程示意图,包括如下过程。
在701部分:获取待识别文件。
以文件识别方法的执行主体为网络设备为例。网络设备获取到的待识别文件可以是:其他网络设备发送给该网络设备的文件。网络设备获取到的待识别文件也可以是:从本地存储的文件中获取的文件。
在702部分:将待识别文件输入预先训练的文件识别模型,确定待识别文件是否为恶意文件。
其中,文件识别模型用于:根据预设读取规则和预设词组模型,确定输入文件对应的多个字符串;根据多个字符串,构建转移矩阵,转移矩阵中的元素与字符串种类一一对应;根据转移矩阵中的元素,确定输入文件对应的目标图像数据;提取目标图像数据的特征,并根据目标图像数据的特征,确定输入文件是否为恶意文件。
这里,输入文件为输入文件识别模型的文件。将待识别文件输入文件识别模型时,输入文件即为待识别文件。字符串种类为字符串的种类,根据不同的读取规则和/或词组模型,获取的字符串的种类也不相同。
在本申请的一个实施例中,根据预设读取规则和预设词组模型,确定输入文件对应的多个字符串,可以包括:按照预设读取规则读取输入文件,得到多个字符,按照预设词组模型,组合多个字符中相邻的字符,得到多个字符串。
其中,读取规则可以包括:2进制、8进制或16进制,但不限于这几种读取规则。预设词组模型可以包括BiGram模型和/或TriGram模型。
在本申请的一个实施例中,根据输入文件对应的多个字符串,构建转移矩阵,可以包括:确定每一字符串在多个字符串中的出现次数,根据每一字符串的出现次数,构建转移矩阵。可选的,转移矩阵的行数和列数相同,转移矩阵的行数和列数均为:字符串种类数与字符种类数的比值。其中,字符串种类数为:根据预设读取规则和预设词组模型确定字符串时,可获取到的字符串的种类数;字符种类数为:根据预设读取规则读取文件时,可获取到的字符的种类数。
例如,预设读取规则为16进制,预设词组模型可以包括BiGram模型和TriGram模型。按照转移矩阵的行数和列数相同,以及转移矩阵中的元素与字符串种类一一对应的规则,转移矩阵的行数和列数可以为272。也就是,可以根据输入文件对应的每一字符串的出现次数,构建272*272的转移矩阵。
在本申请的一个实施例中,根据每一字符串的出现次数,构建转移矩阵,可以包括:针对每一字符串,将该字符串的出现次数作为转移矩阵中该字符串对应的元素的值,得到转移矩阵。
在本申请的另一个实施例中,为了提高平滑度和防止过拟合,根据每一字符串的出现次数,构建转移矩阵,可以包括:针对每一字符串,计算该字符串的出现次数与预设初始值的和值,将计算得到的和值作为转移矩阵中该字符串对应的元素的值,得到转移矩阵。
在本申请的一个实施例中,转移矩阵中的一个元素对应一个图像单元格,确定输入文件对应的目标图像数据,即将转移矩阵中的每个元素的值转换成图像数据。具体的,根据转移矩阵中的元素,确定输入文件对应的目标图像数据,可以包括:根据转移矩阵中各元素的值,计算转移矩阵中各元素对应的图 像单元格的颜色深度,得到输入文件对应的目标图像数据。至此,完成了“输入文件→字符/字符串→转移矩阵→图像数据”的转换过程。
上述图像单元格为图像处理的最小单元。颜色深度是指黑白图像中点的灰度值。本申请实施例中,将颜色深度作为图像单元格的值。
在本申请的一个实施例中,对于转移矩阵中的任一元素,可以采用以下方式确定各元素对应的图像单元格的颜色深度。具体的,根据转移矩阵中各元素的值,计算转移矩阵中各元素对应的图像单元格的颜色深度,可以包括:针对转移矩阵中的第一元素,确定第一元素的值为第一数值。其中,第一元素为转移矩阵中的任一元素,第一元素的值根据第一字符串的出现次数确定。第一字符串为转移矩阵中第一元素对应的字符串。
确定所有第二元素的值之和为第二数值。其中,第二元素的值根据第二字符串的出现次数确定,第二字符串的头部词与第一字符串的头部词相同。头部词即为第一个字符。
计算第一数值与第二数值的比值。
之后,根据计算得到的比值,确定转移矩阵中第一元素对应的图像单元格的颜色深度。
一种实现方式中,针对转移矩阵中的每一元素(例如第一元素),可以将计算得到的比值,作为转移矩阵中第一元素对应的图像单元格的颜色深度。
另一种实现方式中,针对转移矩阵中的每一元素(例如第一元素),可以根据以下公式确定第一元素的转移概率:
h=Log T;
其中,h为第一元素的转移概率,T为计算得到的比值,即第一数值与第二数值的比值。
将计算得到的第一元素的转移概率,确定为第一元素对应的图像单元格的颜色深度。
在本申请的一个实施例中,提取目标图像数据的特征,可以包括:将目标图像数据输入预先训练的CNN模型,得到目标图像数据的特征。
为了获得更为适用于文件识别的CNN模型,在本申请的一个实施例中,采用的CNN模型可以以经典CNN Lenet-5模型为基础,在经典CNN Lenet-5结构的基础上进行改进得到。其中,Lenet-5为一种经典的CNN网络架构,包括3个卷积层、2个池化层和2个全连接层。一种实现方式中,对Lenet-5结构的改进,如图5所示。
01、第一个卷积层包括32个卷积核,第二个卷积层包括64个卷积核。
02、第二个池化层后面增加0.25的DropOut层,第一个全连接层后面增加0.5的DropOut层。
在本申请的一个实施例中,可以采用DNN模型对目标图像数据的特征进行识别,也就是,采用DNN模型,利用目标图像数据的特征对待识别文件进行识别,确定输入文件是否为恶意文件。具体的,根据目标图像数据的特征,确定输入文件是否为恶意文件,可以包括:将目标图像数据的特征输入预先训练的DNN模型,得到输出结果;其中,DNN模型用于对图像数据的特征进行识别,确定图像数据对应的文件是否为恶意文件,输出结果指示输入文件是否为恶意文件。
例如,将目标图像数据的特征输入DNN模型,得到输入文件为安全文件的第一概率,以及输入文件为恶意文件的第二概率。若第一概率大于第二概率,则DNN模型的输出结果指示输入文件为安全文件。否则,DNN模型的输出结果指示输入文件为恶意文件。
本申请实施例中,为了提高文件识别的准确性,在对待识别文件进行识别前,可预先训练DNN模 型和CNN模型。DNN模型和CNN模型的训练过程可参看图6所示实施例中601-609部分的描述说明。
在本申请的一个实施例中,可以采用恶意文件特征库对目标图像数据的特征进行识别,确定待识别文件是否为恶意文件。其中,恶意文件特征库包括:多个样本恶意文件对应的图像数据的特征。具体的,将目标图像数据输入CNN模型,获取CNN模型的预设层的输出结果作为目标图像数据的特征。从预设的恶意文件特征库中查找目标图像数据的特征。若查找到,则确定输入文件为恶意文件。若未查找到,则确定输入文件为安全文件。
为了进一步提高文件识别的准确性,提高文件识别的效率,一个可选的实施例中,在预先训练获得了CNN模型后,可以将样本恶意文件对应的图像数据输入CNN模型,获取CNN模型的预设层的输出结果,将CNN模型的预设层的输出结果作为样本恶意文件对应的图像数据的特征。由这多个样本恶意文件对应的图像数据的特征,构建恶意文件特征库。
可选的,为了避免图像数据的特征过长,增加文件识别的计算量,同时,为了避免图像数据的特征过短,降低文件识别的准确性,预设层可以为CNN模型的第三个卷积层,如图4所示。可选的,第三个卷积层输出的特征长度为512字节。
由于恶意文件特征库中的特征是从恶意文件中直接提取到的,若待识别文件的特征与恶意文件特征库中的特征匹配,可以确定待识别文件为恶意文件,提高了文件识别的准确性。另外,相较于DNN模型识别特征的计算量,匹配恶意文件特征库中的特征计算量要小很多,提高了文件识别的效率。
本申请实施例中,进行文件识别时,将待识别文件转换为图像数据,提取图像数据的特征,根据提取的特征,确定待识别文件是否为恶意文件。其中,图像数据的特征是待识别文件客观存在的特征,而不是根据经验设定的,依据此客观存在的特征得到文件识别结果,降低了文件识别对人的主观因素的依赖,提高了文件识别的准确性。
基于相同的发明构思,本申请实施例还提供了一种特征提取方法。参考图8,图8为本申请实施例提供的特征提取方法的一种流程示意图。该方法包括如下过程。
在801部分:将多个样本文件分别输入文件识别模型。
其中,文件识别模型用于:根据预设读取规则和预设词组模型,确定输入文件对应的多个字符串;根据输入文件对应的多个字符串,构建转移矩阵;根据文件的转移矩阵中的元素,确定输入文件对应的目标图像数据,其中,转移矩阵中的元素与字符串种类一一对应;利用CNN模型提取输入目标图像数据的特征,并利用DNN模型对目标图像数据的特征进行识别,确定输入文件是否为恶意文件。其中,字符串种类为字符串的种类,根据不同的读取规则和/或词组模型,获取的字符串的种类也不相同。
这里,输入文件为输入文件识别模型的文件。将多个样本文件分别输入文件识别模型时,这多个样本文件均为输入文件。
本申请实施例中,为了提高提取特征的准确性,在提取特征前训练DNN模型和CNN模型。DNN模型和CNN模型的训练过程可参看图6所示实施例中601-609部分的描述说明。
在802部分:针对每一样本文件,提取CNN模型的预设层的输出结果,作为该样本文件的特征。
在本申请的一个实施例中,根据预设读取规则和预设词组模型,确定输入文件对应的多个字符串,可以包括:按照预设读取规则读取输入别文件,得到多个字符;按照预设词组模型,组合多个字符中相邻的字符,得到多个字符串。
其中,读取规则可以包括:2进制、8进制或16进制,但不限于这几种读取规则。预设词组模型可以 包括BiGram模型和/或TriGram模型。
在本申请的一个实施例中,根据输入文件对应的多个字符串,构建转移矩阵,可以包括:确定每一字符串在多个字符串中的出现次数;根据每一字符串的出现次数,构建转移矩阵。可选的,转移矩阵的行数和列数相同,转移矩阵的行数和列数均为:字符串种类数与字符种类数的比值。其中,字符串种类数为:根据预设读取规则和预设词组模型确定字符串时,可获取到的字符串的种类数;字符种类数为:根据预设读取规则读取文件时,可获取到的字符的种类数。
例如,预设读取规则为16进制,预设词组模型可以包括BiGram模型和TriGram模型。按照转移矩阵的行数和列数相同,以及转移矩阵中的元素与字符串种类一一对应的规则,转移矩阵的行数和列数可以为272。也就是,可以根据输入文件对应的每一字符串的出现次数,构建272*272的转移矩阵。
在本申请的一个实施例中,根据每一字符串的出现次数,构建转移矩阵,可以包括:针对每一字符串,将该字符串的出现次数作为转移矩阵中该字符串对应的元素的值,得到转移矩阵。
在本申请的另一个实施例中,为了提高平滑度和防止过拟合,根据每一字符串的出现次数,构建转移矩阵,可以包括:针对每一字符串,计算该字符串的出现次数与预设初始值的和值,将计算得到的和值作为转移矩阵中该字符串对应的元素的值,得到转移矩阵。
在本申请的一个实施例中,转移矩阵中的一个元素对应一个图像单元格,确定输入文件对应的目标图像数据,即将转移矩阵中的每个元素的值转换成图像数据。具体的,根据转移矩阵中的元素,确定输入文件对应的目标图像数据,可以包括:根据转移矩阵中各元素的值,计算转移矩阵中各元素对应的图像单元格的颜色深度,得到输入文件对应的目标图像数据。至此,完成了“输入文件→字符/字符串→转移矩阵→图像数据”的转换过程。
上述图像单元格为图像处理的最小单元。颜色深度是指黑白图像中点的灰度值。本申请实施例中,将颜色深度作为图像单元格的值。
在本申请的一个实施例中,对于转移矩阵中的任一元素,可以采用以下方式确定各元素对应的图像单元格的颜色深度。具体的,根据转移矩阵中各元素的值,计算转移矩阵中各元素对应的图像单元格的颜色深度,可以包括:
针对转移矩阵中的第一元素,确定第一元素的值为第一数值。其中,第一元素为转移矩阵中的任一元素,第一元素的值根据第一字符串的出现次数确定。第一字符串为转移矩阵中第一元素对应的字符串。
确定所有第二元素的值之和为第二数值。其中,第二元素的值根据第二字符串的出现次数确定,第二字符串的头部词与第一字符串的头部词相同。这里,第二字符串中包括了第一字符串。头部词即为第一个字符。
计算第一数值与第二数值的比值。
之后,根据计算得到的比值,确定转移矩阵中第一元素对应的图像单元格的颜色深度。
一种实现方式中,针对转移矩阵中的每一元素(例如第一元素),可以将计算得到的比值,作为转移矩阵中第一元素对应的图像单元格的颜色深度。
另一种实现方式中,针对转移矩阵中的每一元素(例如第一元素),可以根据以下公式确定第一元素的转移概率:
h=Log T;
其中,h为第一元素的转移概率,T为计算得到的比值,即第一数值与第二数值的比值。
将计算得到的第一元素的转移概率确定为第一元素对应的图像单元格的颜色深度。
为了获得更为适用于文件识别的CNN模型,在本申请的一个实施例中,采用的CNN模型可以以经典CNN Lenet-5模型为基础,在经典CNN Lenet-5结构的基础上进行改进得到。其中,Lenet-5为一种经典的CNN网络架构,包括3个卷积层、2个池化层和2个全连接层。一种实现方式中,对Lenet-5结构的改进,如图5所示。
01、第一个卷积层包括32个卷积核,第二个卷积层包括64个卷积核。
02、第二个池化层后面增加0.25的DropOut层,第一个全连接层后面增加0.5的DropOut层。
在本申请的一个实施例中,样本文件为样本恶意文件。这种情况下,在提取文件识别模型中CNN模型的预设层的输出结果,作为样本文件的特征之后,还可以包括:根据提取的多个特征构建恶意文件特征库。
可选的,为了避免图像数据的特征过长,增加文件识别的计算量,同时,为了避免图像数据的特征过短,降低文件识别的准确性,预设层可以为CNN模型的第三个卷积层。可选的,第三个卷积层输出的特征长度为512字节。
在本申请的一个实施例中,可以采用恶意文件特征库对待识别文件进行识别,确定待识别文件是否为恶意文件。具体的,将待识别文件输入文件识别模型;获取文件识别模型中CNN模型的预设层的输出结果,作为目标特征;从恶意文件特征库中查找目标特征。若查找到,则确定待识别文件为恶意文件。若未查找到,则确定待识别文件为安全文件。
本申请实施例中,提取预先训练获得的识别模型中CNN模型的预设层输出的特征,不需要人工分析处理提取文件的特征,提高了特征提取的效率,降低了人工成本。
另外,基于提取的恶意文件的特征构建恶意文件特征库,基于恶意文件特征库对待识别文件进行识别。由于恶意文件特征库中包括的特征是从恶意文件中直接提取到的,若待识别文件的特征与恶意文件特征库中的特征匹配,可以确定待识别文件为恶意文件,提高了文件识别的准确性。另外,相较于DNN模型识别特征的计算量,匹配恶意文件特征库中的特征计算量要小很多,提高了文件识别的效率。
基于相同的发明构思,本申请实施例还提供了一种文件识别装置。参考图9,图9为本申请实施例提供的文件识别装置的第一种结构示意图,该装置包括:
获取模块901,用于获取待识别文件;
第一确定模块902,用于根据预设读取规则和预设词组模型,确定待识别文件对应的多个字符串;
构建模块903,用于根据多个字符串,构建转移矩阵;其中,转移矩阵中的元素与字符串种类一一对应;
第二确定模块904,用于根据转移矩阵中的元素,确定待识别文件对应的目标图像数据;
识别模块905,用于提取目标图像数据的特征,并根据目标图像数据的特征,确定待识别文件是否为恶意文件。
上述字符串种类为字符串的种类,根据不同的读取规则和/或词组模型,获取的字符串的种类也不相同。
在本申请的一个实施例中,第一确定模块902,具体可以用于:
按照预设读取规则读取待识别文件,得到多个字符;
按照预设词组模型,组合多个字符中相邻的字符,得到多个字符串。
在本申请的一个实施例中,构建模块903,具体可以用于:
确定每一字符串在多个字符串中的出现次数;
根据每一字符串的出现次数,构建转移矩阵。
在本申请的一个实施例中,构建模块903,具体可以用于:
针对每一字符串,将该字符串的出现次数作为转移矩阵中该字符串对应的元素的值,得到转移矩阵;或者,
针对每一字符串,计算该字符串的出现次数与预设初始值的和值,将计算得到的和值作为转移矩阵中该字符串对应的元素的值,得到转移矩阵。
在本申请的一个实施例中,第二确定模块904,具体可以用于:
根据转移矩阵中各元素的值,计算转移矩阵中各元素对应的图像模块格的颜色深度,得到待识别文件对应的目标图像数据。
在本申请的一个实施例中,第二确定模块904,具体可以用于:
针对转移矩阵中的第一元素,确定第一元素的值为第一数值;其中,第一元素为转移矩阵中的任一元素,第一元素的值根据第一字符串的出现次数确定;
确定所有第二元素的值之和为第二数值;其中,第二元素的值根据第二字符串的出现次数确定,第二字符串的头部词与第一字符串的头部词相同;
计算第一数值与第二数值的比值;
根据计算得到的比值,确定第一元素对应的图像单元格的颜色深度。
上述第一字符串为转移矩阵中第一元素对应的字符串。
在本申请的一个实施例中,第二确定模块904,具体可以用于:
针对第一元素,根据以下公式确定第一元素的转移概率:
h=Log T;
其中,h为第一元素的转移概率,T为计算得到的比值;
将计算得到的第一元素的转移概率,确定为第一元素对应的图像模块格的颜色深度。
在本申请的一个实施例中,识别模块905,具体可以用于:将目标图像数据输入预先训练的CNN模型,得到目标图像数据的特征;
其中,CNN模型以经典CNN Lenet-5模型为基础,第一个卷积层包括32个卷积核,第二个卷积层包括64个卷积核,第二个池化层后面增加0.25的DropOut层,第一个全连接层后面增加0.5的DropOut层。
在本申请的一个实施例中,识别模块905,具体可以用于:
将目标图像数据的特征输入预先训练的DNN模型,得到输出结果;其中,DNN模型用于利用图像数据的特征对文件进行识别,确定图像数据对应的文件是否为恶意文件,输出结果指示待识别文件是否为恶意文件。
在本申请的一个实施例中,目标图像数据的特征为CNN模型的预设层的输出结果;
此时,识别模块905,具体可以用于:
从预设恶意文件特征库中查找目标图像数据的特征;预设恶意文件特征库包括:多个样本恶意文件对应的图像数据的特征;
若查找到,则确定待识别文件为恶意文件;
若未查找到,则确定待识别文件为安全文件。
在一可选的实施例中,多个样本恶意文件对应的图像数据的特征的获取方式可以为:针对每一样本恶意文件,将该样本恶意文件对应的图像数据输入CNN模型,并将CNN模型的预设层对应输出的结果作为对应的图像数据的特征。
本申请实施例中,进行文件识别时,将待识别文件转换为图像数据,提取图像数据的特征,根据提取的特征,确定待识别文件是否为恶意文件。其中,图像数据的特征是待识别文件客观存在的特征,而不是根据经验设定的,依据此客观存在的特征得到文件识别结果,降低了文件识别对人的主观因素的依赖,提高了文件识别的准确性。
基于相同的发明构思,本申请实施例还提供了一种文件识别装置。参考图10,图10为本申请实施例提供的文件识别装置的第二种结构示意图,该装置包括:获取模块1001、输入模块1002和文件识别模型,文件识别模型包括:第一确定模块1003、构建模块1004、第二确定模块1005和识别模块1006;
获取模块1001,用于获取待识别文件;
输入模块1002,用于将待识别文件输入预先训练的文件识别模型;
第一确定模块1003,用于根据预设读取规则和预设词组模型,确定输入文件对应的多个字符串;
构建模块1004,用于根据输入文件对应的多个字符串,构建转移矩阵;转移矩阵中的元素与字符串种类一一对应;
第二确定模块1005,用于根据转移矩阵中的元素,确定输入文件对应的目标图像数据;
识别模块1006,用于提取目标图像数据的特征,并根据目标图像数据的特征,确定输入文件是否为恶意文件。
上述字符串种类为字符串的种类,根据不同的读取规则和/或词组模型,获取的字符串的种类也不相同。
在本申请的一个实施例中,第一确定模块1003,具体可以用于:
按照预设读取规则读取输入文件,得到多个字符;按照预设词组模型,组合多个字符中相邻的字符,得到多个字符串。
在本申请的一个实施例中,构建模块1004,具体可以用于:
确定每一字符串在多个字符串中的出现次数;根据每一字符串的出现次数,构建转移矩阵。
在本申请的一个实施例中,构建模块1004,具体可以用于:
针对每一字符串,将该字符串的出现次数作为转移矩阵中该字符串对应的元素的值,得到转移矩阵;或者,
针对每一字符串,计算该字符串的出现次数与预设初始值的和值,将计算得到的和值作为转移矩阵中该字符串对应的元素的值,得到转移矩阵。
在本申请的一个实施例中,第二确定模块1005,具体可以用于:
根据转移矩阵中各元素的值,计算转移矩阵中各元素对应的图像模块格的颜色深度,得到输入文件对应的目标图像数据。
在本申请的一个实施例中,第二确定模块1005,具体可以用于:
针对转移矩阵中的第一元素,确定第一元素的值为第一数值;其中,第一元素为转移矩阵中的任一元素,第一元素的值根据第一字符串的出现次数确定;
确定所有第二元素的值之和为第二数值;其中,第二元素的值根据第二字符串的出现次数确定,第二字符串的头部词与第一字符串的头部词相同;
计算第一数值与第二数值的比值;
根据计算得到的比值,确定第一元素对应的图像单元格的颜色深度。
上述第一字符串为转移矩阵中第一元素对应的字符串。
在本申请的一个实施例中,第二确定模块1005,具体可以用于:
针对第一元素,根据以下公式确定第一元素的转移概率:
h=Log T;
其中,h为第一元素的转移概率,T为计算得到的比值;
将计算得到的第一元素的转移概率确定为第一元素对应的图像模块格的颜色深度。
在本申请的一个实施例中,识别模块1006,具体可以用于:将目标图像数据输入预先训练的CNN模型,得到目标图像数据的特征;
其中,CNN模型以经典CNN Lenet-5模型为基础,第一个卷积层包括32个卷积核,第二个卷积层包括64个卷积核,第二个池化层后面增加0.25的DropOut层,第一个全连接层后面增加0.5的DropOut层。
在本申请的一个实施例中,识别模块1006,具体可以用于:
将目标图像数据的特征输入预先训练的DNN模型,得到输出结果;其中,DNN模型用于利用图像数据的特征对文件进行识别,确定图像数据对应的文件是否为恶意文件,输出结果指示输入文件是否为恶意文件。
在本申请的一个实施例中,目标图像数据的特征为CNN模型的预设层的输出结果;
此时,识别模块1006,具体可以用于:
从预设恶意文件特征库中查找目标图像数据的特征;预设恶意文件特征库包括:多个样本恶意文件对应的图像数据的特征;
若查找到,则确定输入文件为恶意文件;
若未查找到,则确定输入文件为安全文件。
在一可选的实施例中,多个样本恶意文件对应的图像数据的特征的获取方式可以为:针对每一样本恶意文件,将该样本恶意文件对应的图像数据输入CNN模型,并将CNN模型的预设层对应输出的结果作为对应的图像数据的特征。
本申请实施例中,进行文件识别时,将待识别文件转换为图像数据,提取图像数据的特征,根据提取的特征,确定待识别文件是否为恶意文件。其中,图像数据的特征是待识别文件客观存在的特征,而不是根据经验设定的,依据此客观存在的特征得到文件识别结果,降低了文件识别对人的主观因素的依赖,提高了文件识别的准确性。
基于相同的发明构思,本申请实施例还提供了一种特征提取装置。参考图11,图11为本申请实施例提供的特征提取装置的一种结构示意图,该装置包括:输入模块1101、提取模块1102和文件识别模型;文件识别模型包括第一确定模块1103、第一构建模块1104、第二确定模块1105和第一识别模块1106。
输入模块1101,用于将多个样本文件分别输入文件识别模型;
第一确定模块1103,用于根据预设读取规则和预设词组模型,确定输入文件对应的多个字符串;
第一构建模块1104,用于根据输入文件对应的多个字符串,构建转移矩阵;转移矩阵中的元素与字 符串种类一一对应;
第二确定模块1105,用于根据转移矩阵中的元素,确定输入文件对应的目标图像数据;
第一识别模块1106,用于利用CNN模型提取输入目标图像数据的特征,并利用DNN模型对目标图像数据的特征进行识别,确定输入文件是否为恶意文件;
提取模块1102,用于针对每一样本文件,提取CNN模型的预设层输出的特征的输出结果,作为该样本文件的特征。
上述字符串种类为字符串的种类,根据不同的读取规则和/或词组模型,获取的字符串的种类也不相同。
上述利用DNN模型对目标图像数据的特征进行识别,确定输入文件是否为恶意文件,即为DNN模型利用图像数据的特征对输入文件进行识别,确定输入文件是否为恶意文件。
在本申请的一个实施例中,第一确定模块1103,具体可以用于:
按照预设读取规则读取输入文件,得到多个字符;按照预设词组模型,组合多个字符中相邻的字符,得到多个字符串。
在本申请的一个实施例中,第一构建模块1104,具体可以用于:
确定每一字符串在多个字符串中的出现次数;根据每一字符串的出现次数,构建转移矩阵。
在本申请的一个实施例中,第一构建模块1104,具体可以用于:
针对每一字符串,将该字符串的出现次数作为转移矩阵中该字符串对应的元素的值,得到转移矩阵;或者,
针对每一字符串,计算该字符串的出现次数与预设初始值的和值,将计算得到的和值作为转移矩阵中该字符串对应的元素的值,得到转移矩阵。
在本申请的一个实施例中,第二确定模块1105,具体可以用于:
根据转移矩阵中各元素的值,计算转移矩阵中各元素对应的图像模块格的颜色深度,得到输入件对应的目标图像数据。
在本申请的一个实施例中,第二确定模块1105,具体可以用于:
针对转移矩阵中的第一元素,确定第一元素的值为第一数值;其中,第一元素为转移矩阵中的任一元素,第一元素的值根据第一字符串的出现次数确定;
确定所有第二元素的值之和为第二数值;其中,第二元素的值根据第二字符串的出现次数确定,第二字符串的头部词与第一字符串的头部词相同;
计算第一数值与第二数值的比值;
根据计算得到的比值,确定第一元素对应的图像单元格的颜色深度。
上述第一字符串为转移矩阵中第一元素对应的字符串。
在本申请的一个实施例中,第二确定模块1105,具体可以用于:
针对第一元素,根据以下公式确定第一元素的转移概率:
h=Log T;
其中,h为第一元素的转移概率,T为计算得到的比值;
将计算得到的第一元素的转移概率确定为第一元素对应的图像模块格的颜色深度。
在本申请的一个实施例中,CNN模型以经典CNN Lenet-5模型为基础,第一个卷积层包括32个卷积 核,第二个卷积层包括64个卷积核,第二个池化层后面增加0.25的DropOut层,第一个全连接层后面增加0.5的DropOut层。
在本申请的一个实施例中,样本文件为样本恶意文件;
上述特征提取装置还可以包括:第二构建模块,用于在针对每一样本文件,提取CNN模型的预设层的输出结果,作为该样本文件的特征之后,根据提取的多个特征构建恶意文件特征库。
在本申请的一个实施例中,上述特征提取装置还可以包括:第二识别模块,用于:
将待识别文件输入文件识别模型;获取文件识别模型中CNN模型的预设层的输出结果,作为目标特征;从恶意文件特征库中查找目标特征;若查找到,则确定待识别文件为恶意文件;若未查找到,则确定待识别文件为安全文件。
本申请实施例中,提取文件预先训练获得的识别模型中CNN模型的预设层输出的特征,不需要人工分析处理提取文件的特征,提高了特征提取的效率,降低了人工成本。
另外,基于提取的恶意文件的特征构建恶意文件特征库,基于恶意文件特征库对待识别文件进行识别。由于恶意文件特征库中包括的特征是从恶意文件中直接提取到的,若待识别文件的特征与恶意文件特征库中的特征匹配,可以确定待识别文件为恶意文件,提高了文件识别的准确性。另外,相较于DNN模型识别特征的计算量,匹配恶意文件特征库中的特征计算量要小很多,提高了文件识别的效率。
基于相同的发明构思,本申请实施例还提供了一种网络设备,如图12所示,包括处理器1201和机器可读存储介质1202,机器可读存储介质1202存储有能够被处理器1201执行的机器可执行指令。处理器1201被机器可执行指令促使实现上述图1所示的文件识别方法。具体的,处理器1201被机器可执行指令促使实现:
获取待识别文件;
根据预设读取规则和预设词组模型,确定待识别文件对应的多个字符串;
根据多个字符串,构建转移矩阵;其中,转移矩阵中的元素与字符串种类一一对应;
根据转移矩阵中的元素,确定待识别文件对应的目标图像数据;
提取目标图像数据的特征,并根据目标图像数据的特征,确定待识别文件是否为恶意文件。
上述字符串种类为字符串的种类,根据不同的读取规则和/或词组模型,获取的字符串的种类也不相同。
在本申请的一个实施例中,处理器1201被机器可执行指令促使具体可以实现:
按照预设读取规则读取待识别文件,得到多个字符;按照预设词组模型,组合多个字符中相邻的字符,得到多个字符串。
在本申请的一个实施例中,处理器1201被机器可执行指令促使具体可以实现:
确定每一字符串在多个字符串中的出现次数;根据每一字符串的出现次数,构建转移矩阵。
在本申请的一个实施例中,处理器1201被机器可执行指令促使具体可以实现:
针对每一字符串,将该字符串的出现次数作为转移矩阵中该字符串对应的元素的值,得到转移矩阵;或者,
针对每一字符串,计算该字符串的出现次数与预设初始值的和值,将计算得到的和值作为转移矩阵中该字符串对应的元素的值,得到转移矩阵。
在本申请的一个实施例中,处理器1201被机器可执行指令促使具体可以实现:
根据转移矩阵中各元素的值,计算转移矩阵中各元素对应的图像指令格的颜色深度,得到待识别文件对应的目标图像数据。
在本申请的一个实施例中,处理器1201被机器可执行指令促使具体可以实现:
针对转移矩阵中的第一元素,确定第一元素的值为第一数值;其中,第一元素为转移矩阵中的任一元素,第一元素的值根据第一字符串的出现次数确定;
确定所有第二元素的值之和为第二数值;其中,第二元素的值根据第二字符串的出现次数确定,第二字符串的头部词与第一字符串的头部词相同;
计算第一数值与第二数值的比值;
根据计算得到的比值,确定第一元素对应的图像单元格的颜色深度。
上述第一字符串为转移矩阵中第一元素对应的字符串。
在本申请的一个实施例中,处理器1201被机器可执行指令促使具体可以实现:
针对第一元素,根据以下公式确定第一元素的转移概率:
h=Log T;
其中,h为第一元素的转移概率,T为计算得到的比值;
将计算得到的第一元素的转移概率,确定为第一元素对应的图像单元格的颜色深度。
在本申请的一个实施例中,处理器1201被机器可执行指令促使具体可以实现:
将目标图像数据输入预先训练的CNN模型,得到目标图像数据的特征;其中,CNN模型以经典CNN Lenet-5模型为基础,第一个卷积层包括32个卷积核,第二个卷积层包括64个卷积核,第二个池化层后面增加0.25的DropOut层,第一个全连接层后面增加0.5的DropOut层。
在本申请的一个实施例中,处理器1201被机器可执行指令促使具体可以实现:
将目标图像数据的特征输入预先训练的DNN模型,得到输出结果;其中,DNN模型用于利用图像数据的特征对文件进行识别,确定图像数据对应的文件是否为恶意文件,输出结果指示待识别文件是否为恶意文件。
在本申请的一个实施例中,目标图像数据的特征为CNN模型的预设层的输出结果;
此时,处理器1201被机器可执行指令促使具体可以实现:
从预设恶意文件特征库中查找目标图像数据的特征;预设恶意文件特征库包括:多个样本恶意文件对应的图像数据的特征;
若查找到,则确定待识别文件为恶意文件;
若未查找到,则确定待识别文件为安全文件。
在一可选的实施例中,多个样本恶意文件对应的图像数据的特征的获取方式可以为:针对每一样本恶意文件,将该样本恶意文件对应的图像数据输入CNN模型,并将CNN模型的预设层对应输出的结果作为对应的图像数据的特征。
一个可选的实施例中,如图12所示,网络设备还可以包括:通信接口1203和通信总线1204;其中,处理器1201、机器可读存储介质1202、通信接口1203通过通信总线1204完成相互间的通信,通信接口1203用于上述网络设备与其他设备之间的通信。
本申请实施例中,进行文件识别时,将待识别文件转换为图像数据,提取图像数据的特征,根据提取的特征,确定待识别文件是否为恶意文件。其中,图像数据的特征是待识别文件客观存在的特征,而 不是根据经验设定的,依据此客观存在的特征得到文件识别结果,降低了文件识别对人的主观因素的依赖,提高了文件识别的准确性。
基于相同的发明构思,本申请实施例还提供了一种网络设备,如图13所示,包括处理器1301和机器可读存储介质1302,机器可读存储介质1302存储有能够被处理器1301执行的机器可执行指令。处理器1301被机器可执行指令促使实现上述图7所示的文件识别方法。具体的,处理器1301被机器可执行指令促使实现:
获取待识别文件;
将待识别文件输入预先训练的文件识别模型,确定待识别文件是否为恶意文件;
其中,文件识别模型用于:根据预设读取规则和预设词组模型,确定输入文件对应的多个字符串;根据多个字符串,构建转移矩阵;转移矩阵中的元素与字符串种类一一对应;根据转移矩阵中的元素,确定输入文件对应的目标图像数据;提取目标图像数据的特征,根据目标图像数据的特征,确定输入文件是否为恶意文件。
上述字符串种类为字符串的种类,根据不同的读取规则和/或词组模型,获取的字符串的种类也不相同。
在本申请的一个实施例中,处理器1301被机器可执行指令促使具体可以实现:
按照预设读取规则读取输入文件,得到多个字符;按照预设词组模型,组合多个字符中相邻的字符,得到多个字符串。
在本申请的一个实施例中,处理器1301被机器可执行指令促使具体可以实现:
确定每一字符串在多个字符串中的出现次数;根据每一字符串的出现次数,构建转移矩阵。
在本申请的一个实施例中,处理器1301被机器可执行指令促使具体可以实现:
针对每一字符串,将该字符串的出现次数作为转移矩阵中该字符串对应的元素的值,得到转移矩阵;或者,
针对每一字符串,计算该字符串的出现次数与预设初始值的和值,将计算得到的和值作为转移矩阵中该字符串对应的元素的值,得到转移矩阵。
在本申请的一个实施例中,处理器1301被机器可执行指令促使具体可以实现:
根据转移矩阵中各元素的值,计算转移矩阵中各元素对应的图像指令格的颜色深度,得到输入文件对应的目标图像数据。
在本申请的一个实施例中,处理器1301被机器可执行指令促使具体可以实现:
针对转移矩阵中的第一元素,确定第一元素的值为第一数值;其中,第一元素为转移矩阵中的任一元素,第一元素的值根据第一字符串的出现次数确定;
确定所有第二元素的值之和为第二数值;其中,第二元素的值根据第二字符串的出现次数确定,第二字符串的头部词与第一字符串的头部词相同;
计算第一数值与第二数值的比值;
根据计算得到的比值,确定第一元素对应的图像单元格的颜色深度。
上述第一字符串为转移矩阵中第一元素对应的字符串。
在本申请的一个实施例中,处理器1301被机器可执行指令促使具体可以实现:
针对第一元素,根据以下公式确定第一元素的转移概率:
h=Log T;
其中,h为第一元素的转移概率,T为计算得到的比值;
将计算得到的第一元素的转移概率,确定为第一元素对应的图像单元格的颜色深度。
在本申请的一个实施例中,处理器1301被机器可执行指令促使具体可以实现:
将目标图像数据输入预先训练的CNN模型,得到目标图像数据的特征;
其中,CNN模型以经典CNN Lenet-5模型为基础,第一个卷积层包括32个卷积核,第二个卷积层包括64个卷积核,第二个池化层后面增加0.25的DropOut层,第一个全连接层后面增加0.5的DropOut层。
在本申请的一个实施例中,处理器1301被机器可执行指令促使具体可以实现:
将目标图像数据的特征输入预先训练的DNN模型,得到输出结果;其中,DNN模型用于利用图像数据的特征对文件进行识别,确定图像数据对应的文件是否为恶意文件,输出结果指示输入文件是否为恶意文件。
在本申请的一个实施例中,目标图像数据的特征为CNN模型的预设层的输出结果;
此时,处理器1301被机器可执行指令促使具体可以实现:
从预设恶意文件特征库中查找目标图像数据的特征;预设恶意文件特征库包括:多个样本恶意文件对应的图像数据的特征;
若查找到,则确定输入文件为恶意文件;
若未查找到,则确定输入文件为安全文件。
在一可选的实施例中,多个样本恶意文件对应的图像数据的特征的获取方式可以为:针对每一样本恶意文件,将该样本恶意文件对应的图像数据输入CNN模型,并将CNN模型的预设层对应输出的结果作为对应的图像数据的特征。
一个可选的实施例中,如图13所示,网络设备还可以包括:通信接口1303和通信总线1304;其中,处理器1301、机器可读存储介质1302、通信接口1303通过通信总线1304完成相互间的通信,通信接口1303用于上述网络设备与其他设备之间的通信。
本申请实施例中,进行文件识别时,将待识别文件转换为图像数据,提取图像数据的特征,根据提取的特征,确定待识别文件是否为恶意文件。其中,图像数据的特征是待识别文件客观存在的特征,而不是根据经验设定的,依据此客观存在的特征得到文件识别结果,降低了文件识别对人的主观因素的依赖,提高了文件识别的准确性。
基于相同的发明构思,本申请实施例还提供了一种网络设备,如图14所示,包括处理器1401和机器可读存储介质1402,机器可读存储介质1402存储有能够被处理器1401执行的机器可执行指令。处理器1401被机器可执行指令促使实现上述图8所示的特征提取方法。具体的,处理器1401被机器可执行指令促使实现:
将多个样本文件分别输入文件识别模型;其中,文件识别模型用于:根据预设读取规则和预设词组模型,确定输入文件对应的多个字符串;根据输入文件对应的多个字符串,构建转移矩阵,转移矩阵中的元素与字符串种类一一对应;根据转移矩阵中的元素,确定输入文件对应的目标图像数据;利用CNN模型提取输入文件对应的目标图像数据的特征,并利用DNN模型对目标图像数据的特征进行识别,确定输入文件是否为恶意文件;
针对每一样本文件,提取CNN模型的预设层的输出结果,作为该样本文件的特征。
上述字符串种类为字符串的种类,根据不同的读取规则和/或词组模型,获取的字符串的种类也不相同。
上述利用DNN模型对目标图像数据的特征进行识别,确定输入文件是否为恶意文件,即为DNN模型利用图像数据的特征对输入文件进行识别,确定输入文件是否为恶意文件。
在本申请的一个实施例中,处理器1401被机器可执行指令促使具体可以实现:
按照预设读取规则读取输入文件,得到多个字符;按照预设词组模型,组合多个字符中相邻的字符,得到多个字符串。
在本申请的一个实施例中,处理器1401被机器可执行指令促使具体可以实现:
确定每一字符串在多个字符串中的出现次数;根据每一字符串的出现次数,构建转移矩阵。
在本申请的一个实施例中,处理器1401被机器可执行指令促使具体可以实现:
针对每一字符串,将该字符串的出现次数作为转移矩阵中该字符串对应的元素的值,得到转移矩阵;或者,
针对每一字符串,计算该字符串的出现次数与预设初始值的和值,将计算得到的和值作为转移矩阵中该字符串对应的元素的值,得到转移矩阵。
在本申请的一个实施例中,处理器1401被机器可执行指令促使具体可以实现:
根据转移矩阵中各元素的值,计算转移矩阵中各元素对应的图像指令格的颜色深度,得到输入文件对应的目标图像数据。
在本申请的一个实施例中,处理器1401被机器可执行指令促使具体可以实现:
针对转移矩阵中的第一元素,确定第一元素的值为第一数值;其中,第一元素为转移矩阵中的任一元素,第一元素的值根据第一字符串的出现次数确定;
确定所有第二元素的值之和为第二数值;其中,第二元素的值根据第二字符串的出现次数确定,第二字符串的头部词与第一字符串的头部词相同;
计算第一数值与第二数值的比值;
根据计算得到的比值,确定第一元素对应的图像单元格的颜色深度。
上述第一字符串为转移矩阵中第一元素对应的字符串。
在本申请的一个实施例中,处理器1401被机器可执行指令促使具体可以实现:
针对第一元素,根据以下公式确定第一元素的转移概率:
h=Log T;
其中,h为第一元素的转移概率,T为计算得到的比值;
将计算得到的第一元素的转移概率,确定为第一元素对应的图像单元格的颜色深度。
在本申请的一个实施例中,CNN模型以经典CNN Lenet-5模型为基础,第一个卷积层包括32个卷积核,第二个卷积层包括64个卷积核,第二个池化层后面增加0.25的DropOut层,第一个全连接层后面增加0.5的DropOut层。
在本申请的一个实施例中,样本文件为样本恶意文件;
处理器1401被机器可执行指令促使还可以实现:在针对每一样本文件,提取CNN模型的预设层的输出结果,作为该样本文件的特征之后,根据获取的多个特征构,构建恶意文件特征库。
在本申请的一个实施例中,机器可执行指令还可以包括:第二识别指令;
处理器1401被机器可执行指令促使还可以实现:将待识别文件输入文件识别模型;获取文件识别模型中CNN模型的预设层的输出结果,作为目标特征;从恶意文件特征库中查找目标特征;若查找到,则确定待识别文件为恶意文件;若未查找到,则确定待识别文件为安全文件。
一个可选的实施例中,如图14所示,网络设备还可以包括:通信接口1403和通信总线1404;其中,处理器1401、机器可读存储介质1402、通信接口1403通过通信总线1404完成相互间的通信,通信接口1403用于上述网络设备与其他设备之间的通信。
本申请实施例中,提取文件预先训练获得的识别模型中CNN模型的预设层输出的特征,不需要人工分析处理提取文件的特征,提高了特征提取的效率,降低了人工成本。
另外,基于提取的恶意文件的特征构建恶意文件特征库,基于恶意文件特征库对待识别文件进行识别。由于恶意文件特征库中包括的特征是从恶意文件中直接提取到的,若待识别文件的特征与恶意文件特征库中的特征匹配,可以确定待识别文件为恶意文件,提高了文件识别的准确性。另外,相较于DNN模型识别特征的计算量,匹配恶意文件特征库中的特征计算量要小很多,提高了文件识别的效率。
上述通信总线可以是外设部件互连标准(英文:Peripheral Component Interconnect,简称:PCI)总线或扩展工业标准结构(英文:Extended Industry Standard Architecture,简称:EISA)总线等。该通信总线可以分为地址总线、数据总线、控制总线等。
上述机器可读存储介质可以包括随机存取存储器(英文:Random Access Memory,简称:RAM),也可以包括非易失性存储器(英文:Non-Volatile Memory,简称:NVM),例如至少一个磁盘存储器。另外,机器可读存储介质还可以是至少一个位于远离前述处理器的存储装置。
上述处理器可以是通用处理器,包括中央处理器(英文:Central Processing Unit,简称:CPU)、网络处理器(英文:Network Processor,简称:NP)等;还可以是数字信号处理器(英文:Digital Signal Processing,简称:DSP)、专用集成电路(英文:Application Specific Integrated Circuit,简称:ASIC)、现场可编程门阵列(英文:Field-Programmable Gate Array,简称:FPGA)或其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件。
基于相同的发明构思,本申请实施例还提供了一种机器可读存储介质,存储有机器可执行指令,在被处理器调用和执行时,机器可执行指令促使处理器实现上述图1所示的文件识别方法。具体的,机器可执行指令促使处理器实现:
获取待识别文件;
根据预设读取规则和预设词组模型,确定待识别文件对应的多个字符串;
根据多个字符串,构建转移矩阵;其中,转移矩阵中的元素与字符串种类一一对应;
根据转移矩阵中的元素,确定待识别文件对应的目标图像数据;
提取目标图像数据的特征,并根据目标图像数据的特征,确定待识别文件是否为恶意文件。
上述字符串种类为字符串的种类,根据不同的读取规则和/或词组模型,获取的字符串的种类也不相同。
本申请实施例中,进行文件识别时,将待识别文件转换为图像数据,提取图像数据的特征,根据提取的特征,确定待识别文件是否为恶意文件。其中,图像数据的特征是待识别文件客观存在的特征,而不是根据经验设定的,依据此客观存在的特征得到文件识别结果,降低了文件识别对人的主观因素的依赖,提高了文件识别的准确性。
基于相同的发明构思,本申请实施例还提供了一种机器可读存储介质,存储有机器可执行指令,在被处理器调用和执行时,机器可执行指令促使处理器实现上述图7所示的文件识别方法。具体的,机器可执行指令促使处理器实现:
获取待识别文件;
将待识别文件输入预先训练的文件识别模型,确定待识别文件是否为恶意文件;
其中,文件识别模型用于:根据预设读取规则和预设词组模型,确定输入文件对应的多个字符串;根据多个字符串,构建转移矩阵;转移矩阵中的元素与字符串种类一一对应;根据转移矩阵中的元素,确定输入文件对应的目标图像数据;提取目标图像数据的特征,根据目标图像数据的特征,确定输入文件是否为恶意文件。
上述字符串种类为字符串的种类,根据不同的读取规则和/或词组模型,获取的字符串的种类也不相同。
本申请实施例中,进行文件识别时,将待识别文件转换为图像数据,提取图像数据的特征,根据提取的特征,确定待识别文件是否为恶意文件。其中,图像数据的特征是待识别文件客观存在的特征,而不是根据经验设定的,依据此客观存在的特征得到文件识别结果,降低了文件识别对人的主观因素的依赖,提高了文件识别的准确性。
基于相同的发明构思,本申请实施例还提供了一种机器可读存储介质,存储有机器可执行指令,在被处理器调用和执行时,机器可执行指令促使处理器实现上述图8所示的特征提取方法。具体的,机器可执行指令促使处理器实现:
将多个样本文件分别输入文件识别模型;其中,文件识别模型用于:根据预设读取规则和预设词组模型,确定输入文件对应的多个字符串;根据输入文件对应的多个字符串,构建转移矩阵,转移矩阵中的元素与字符串种类一一对应;根据转移矩阵中的元素,确定输入文件对应的目标图像数据;利用CNN模型提取输入文件对应的目标图像数据的特征,并利用DNN模型对目标图像数据的特征进行识别,确定输入文件是否为恶意文件;
针对每一样本文件,提取CNN模型的预设层的输出结果,作为该样本文件的特征。
上述字符串种类为字符串的种类,根据不同的读取规则和/或词组模型,获取的字符串的种类也不相同。
本申请实施例中,提取文件预先训练获得的识别模型中CNN模型的预设层输出的特征,不需要人工分析处理提取文件的特征,提高了特征提取的效率,降低了人工成本。
另外,基于提取的恶意文件的特征构建恶意文件特征库,基于恶意文件特征库对待识别文件进行识别。由于恶意文件特征库中包括的特征是从恶意文件中直接提取到的,若待识别文件的特征与恶意文件特征库中的特征匹配,可以确定待识别文件为恶意文件,提高了文件识别的准确性。另外,相较于DNN模型识别特征的计算量,匹配恶意文件特征库中的特征计算量要小很多,提高了文件识别的效率。
基于相同的发明构思,本申请实施例还提供了一种机器可执行指令,在被处理器调用和执行时,机器可执行指令促使处理器实现上述图1所示的文件识别方法。具体的,机器可执行指令促使处理器实现:
获取待识别文件;
根据预设读取规则和预设词组模型,确定待识别文件对应的多个字符串;
根据多个字符串,构建转移矩阵;其中,转移矩阵中的元素与字符串种类一一对应;
根据转移矩阵中的元素,确定待识别文件对应的目标图像数据;
提取目标图像数据的特征,并根据目标图像数据的特征,确定待识别文件是否为恶意文件。
上述字符串种类为字符串的种类,根据不同的读取规则和/或词组模型,获取的字符串的种类也不相同。
本申请实施例中,进行文件识别时,将待识别文件转换为图像数据,提取图像数据的特征,根据提取的特征,确定待识别文件是否为恶意文件。其中,图像数据的特征是待识别文件客观存在的特征,而不是根据经验设定的,依据此客观存在的特征得到文件识别结果,降低了文件识别对人的主观因素的依赖,提高了文件识别的准确性。
基于相同的发明构思,本申请实施例还提供了一种机器可执行指令,在被处理器调用和执行时,机器可执行指令促使处理器实现上述图7所示的文件识别方法。具体的,机器可执行指令促使处理器实现:获取待识别文件;
将待识别文件输入预先训练的文件识别指模型;确定待识别文件是否为恶意文件;
其中,文件识别模型用于:根据预设读取规则和预设词组模型,确定输入文件对应的多个字符串;根据多个字符串,构建转移矩阵;转移矩阵中的元素与字符串种类一一对应;根据转移矩阵中的元素,确定输入文件对应的目标图像数据;提取目标图像数据的特征,根据目标图像数据的特征,确定输入文件是否为恶意文件。
上述字符串种类为字符串的种类,根据不同的读取规则和/或词组模型,获取的字符串的种类也不相同。
本申请实施例中,进行文件识别时,将待识别文件转换为图像数据,提取图像数据的特征,根据提取的特征,确定待识别文件是否为恶意文件。其中,图像数据的特征是待识别文件客观存在的特征,而不是根据经验设定的,依据此客观存在的特征得到文件识别结果,降低了文件识别对人的主观因素的依赖,提高了文件识别的准确性。
基于相同的发明构思,本申请实施例还提供了一种机器可执行指令,在被处理器调用和执行时,机器可执行指令促使处理器实现上述图8所示的特征提取方法。具体的,机器可执行指令促使处理器实现:
将多个样本文件分别输入文件识别模型;其中,文件识别模型用于:根据预设读取规则和预设词组模型,确定输入文件对应的多个字符串;根据输入文件对应的多个字符串,构建转移矩阵,转移矩阵中的元素与字符串种类一一对应;根据转移矩阵中的元素,确定输入文件对应的目标图像数据;利用CNN模型提取输入文件对应的目标图像数据的特征,并利用DNN模型对目标图像数据的特征进行识别,确定输入文件是否为恶意文件;
针对每一样本文件,提取CNN模型的预设层的输出结果,作为该样本文件的特征。
上述字符串种类为字符串的种类,根据不同的读取规则和/或词组模型,获取的字符串的种类也不相同。
本申请实施例中,提取文件预先训练获得的识别模型中CNN模型的预设层输出的特征,不需要人工分析处理提取文件的特征,提高了特征提取的效率,降低了人工成本。
另外,基于提取的恶意文件的特征构建恶意文件特征库,基于恶意文件特征库对待识别文件进行识别。由于恶意文件特征库中包括的特征是从恶意文件中直接提取到的,若待识别文件的特征与恶意文件特征库中的特征匹配,可以确定待识别文件为恶意文件,提高了文件识别的准确性。另外,相较于DNN 模型识别特征的计算量,匹配恶意文件特征库中的特征计算量要小很多,提高了文件识别的效率。
需要说明的是,在本文中,诸如第一和第二等之类的关系术语仅仅用来将一个实体或者操作与另一个实体或操作区分开来,而不一定要求或者暗示这些实体或操作之间存在任何这种实际的关系或者顺序。而且,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、物品或者设备不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、物品或者设备所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括所述要素的过程、方法、物品或者设备中还存在另外的相同要素。
本说明书中的各个实施例均采用相关的方式描述,各个实施例之间相同相似的部分互相参见即可,每个实施例重点说明的都是与其他实施例的不同之处。尤其,对于文件识别装置、特征提取装置、网络设备、机器可读存储介质实施例而言,由于其基本相似于文件识别方法和特征提取方法实施例,所以描述的比较简单,相关之处参见文件识别方法和特征提取方法实施例的部分说明即可。
以上所述仅为本申请的较佳实施例而已,并非用于限定本申请的保护范围。凡在本申请的精神和原则之内所作的任何修改、等同替换、改进等,均包含在本申请的保护范围内。

Claims (28)

  1. 一种文件识别方法,所述方法包括:
    获取待识别文件;
    根据预设读取规则和预设词组模型,确定所述待识别文件对应的多个字符串;
    根据所述多个字符串,构建转移矩阵;其中,所述转移矩阵中的元素与字符串种类一一对应;
    根据所述转移矩阵中的元素,确定所述待识别文件对应的目标图像数据;
    提取所述目标图像数据的特征,并根据所述目标图像数据的特征,确定所述待识别文件是否为恶意文件。
  2. 根据权利要求1所述的方法,所述根据预设读取规则和预设词组模型,确定所述待识别文件对应的多个字符串,包括:
    按照预设读取规则读取所述待识别文件,得到多个字符;
    按照预设词组模型,组合所述多个字符中相邻的字符,得到多个字符串。
  3. 根据权利要求1所述的方法,所述根据所述多个字符串,构建转移矩阵,包括:
    确定每一字符串在所述多个字符串中的出现次数;
    根据每一字符串的出现次数,构建转移矩阵。
  4. 根据权利要求3所述的方法,所述根据每一字符串的出现次数,构建转移矩阵,包括:
    针对每一字符串,将该字符串的出现次数作为转移矩阵中该字符串对应的元素的值,得到所述转移矩阵;或者,
    针对每一字符串,计算该字符串的出现次数与预设初始值的和值,将计算得到的和值作为转移矩阵中该字符串对应的元素的值,得到所述转移矩阵。
  5. 根据权利要求1所述的方法,所述根据所述转移矩阵中的元素,确定所述待识别文件对应的目标图像数据,包括:
    根据所述转移矩阵中各元素的值,计算所述转移矩阵中各元素对应的图像单元格的颜色深度,得到所述待识别文件对应的目标图像数据。
  6. 根据权利要求5所述的方法,所述根据所述转移矩阵中各元素的值,计算所述转移矩阵中各元素对应的图像单元格的颜色深度,包括:
    针对所述转移矩阵中的第一元素,确定所述第一元素的值为第一数值;其中,所述第一元素为所述转移矩阵中的任一元素,所述第一元素的值根据所述第一元素对应的第一字符串的出现次数确定;
    确定所有第二元素的值之和为第二数值;其中,所述第二元素的值根据第二字符串的出现次数确定,所述第二字符串的头部词与所述第一字符串的头部词相同;
    计算所述第一数值与所述第二数值的比值;
    根据计算得到的比值,确定所述第一元素对应的图像单元格的颜色深度。
  7. 根据权利要求6所述的方法,所述根据计算得到的比值,确定所述第一元素对应的图像单元格的颜色深度,包括:
    针对所述第一元素,根据以下公式确定所述第一元素的转移概率:
    h=Log T;
    其中,h为所述第一元素的转移概率,T为计算得到的比值;
    将计算得到的所述第一元素的转移概率,确定为所述第一元素对应的图像单元格的颜色深度。
  8. 根据权利要求1所述的方法,所述提取所述目标图像数据的特征,包括:
    将所述目标图像数据输入预先训练的卷积神经网络CNN模型,得到所述目标图像数据的特征;
    其中,所述CNN模型以经典CNN Lenet-5模型为基础,第一个卷积层包括32个卷积核,第二个卷积层包括64个卷积核,第二个池化层后面增加0.25的丢弃DropOut层,第一个全连接层后面增加0.5的DropOut层。
  9. 根据权利要求8所述的方法,所述根据所述目标图像数据的特征,确定所述待识别文件是否为恶意文件,包括:
    将所述目标图像数据的特征输入预先训练的深度神经网络DNN模型,得到输出结果;其中,所述DNN模型用于对图像数据的特征进行识别,确定图像数据对应的文件是否为恶意文件,所述输出结果指示所述待识别文件是否为恶意文件。
  10. 根据权利要求8所述的方法,所述目标图像数据的特征为所述CNN模型的预设层的输出结果;
    所述根据所述目标图像数据的特征,确定所述待识别文件是否为恶意文件,包括:
    从预设恶意文件特征库中查找所述目标图像数据的特征;所述预设恶意文件特征库包括:多个样本恶意文件对应的图像数据的特征;
    若查找到,则确定所述待识别文件为恶意文件;
    若未查找到,则确定所述待识别文件为安全文件。
  11. 一种文件识别方法,所述方法包括:
    获取待识别文件;
    将所述待识别文件输入预先训练的文件识别模型,确定所述待识别文件是否为恶意文件;
    其中,所述文件识别模型用于:根据预设读取规则和预设词组模型,确定输入文件对应的多个字符串;根据所述多个字符串,构建转移矩阵,所述转移矩阵中的元素与字符串种类一一对应;根据所述转移矩阵中的元素,确定所述输入文件对应的目标图像数据;提取所述目标图像数据的特征,并根据所述目标图像数据的特征,确定所述输入文件是否为恶意文件。
  12. 一种特征提取方法,所述方法包括:
    将多个样本文件分别输入文件识别模型;其中,所述文件识别模型用于:根据预设读取规则和预设词组模型,确定输入文件对应的多个字符串;根据所述多个字符串,构建转移矩阵,所述转移矩阵中的元素与字符串种类一一对应;根据所述转移矩阵中的元素,确定所述输入文件对应的目标图像数据;利用卷积神经网络CNN模型提取所述目标图像数据的特征;利用深度神经网络DNN模型对所述目标图像数据的特征进行识别,确定所述输入文件是否为恶意文件;
    针对每一样本文件,提取所述CNN模型的预设层的输出结果,作为该样本文件的特征。
  13. 根据权利要求12所述的方法,所述样本文件为样本恶意文件;
    在针对每一样本文件,提取所述CNN模型的预设层的输出结果,作为该样本文件的特征之后,还包括:
    根据提取的多个特征,构建恶意文件特征库。
  14. 根据权利要求13所述的方法,还包括:
    将待识别文件输入所述文件识别模型;
    获取所述CNN模型的预设层的输出结果,作为目标特征;
    从所述恶意文件特征库中查找所述目标特征;
    若查找到,则确定所述待识别文件为恶意文件;
    若未查找到,则确定所述待识别文件为安全文件。
  15. 一种网络设备,包括处理器和机器可读存储介质,所述机器可读存储介质存储有能够被所述处理器执行的机器可执行指令,所述处理器被所述机器可执行指令促使实现:
    获取待识别文件;
    根据预设读取规则和预设词组模型,确定所述待识别文件对应的多个字符串;
    根据所述多个字符串,构建转移矩阵;其中,所述转移矩阵中的元素与字符串种类一一对应;
    根据所述转移矩阵中的元素,确定所述待识别文件对应的目标图像数据;
    提取所述目标图像数据的特征,并根据所述目标图像数据的特征,确定所述待识别文件是否为恶意文件。
  16. 根据权利要求15所述的网络设备,所述处理器被所述机器可执行指令促使具体实现:
    按照预设读取规则读取所述待识别文件,得到多个字符;
    按照预设词组模型,组合所述多个字符中相邻的字符,得到多个字符串。
  17. 根据权利要求15所述的网络设备,所述处理器被所述机器可执行指令促使具体实现:
    确定每一字符串在所述多个字符串中的出现次数;
    根据每一字符串的出现次数,构建转移矩阵。
  18. 根据权利要求17所述的网络设备,所述处理器被所述机器可执行指令促使具体实现:
    针对每一字符串,将该字符串的出现次数作为转移矩阵中该字符串对应的元素的值,得到所述转移矩阵;或者,
    针对每一字符串,计算该字符串的出现次数与预设初始值的和值,将计算得到的和值作为转移矩阵中该字符串对应的元素的值,得到所述转移矩阵。
  19. 根据权利要求15所述的网络设备,所述处理器被所述机器可执行指令促使具体实现:
    根据所述转移矩阵中各元素的值,计算所述转移矩阵中各元素对应的图像单元格的颜色深度,得到所述待识别文件对应的目标图像数据。
  20. 根据权利要求19所述的网络设备,所述处理器被所述机器可执行指令促使具体实现:
    针对所述转移矩阵中的第一元素,确定所述第一元素的值为第一数值;其中,所述第一元素为所述转移矩阵中的任一元素,所述第一元素的值根据所述第一元素对应的第一字符串的出现次数确定;
    确定所有第二元素的值之和为第二数值;其中,所述第二元素的值根据第二字符串的出现次数确定,所述第二字符串的头部词与所述第一字符串的头部词相同;
    计算所述第一数值与所述第二数值的比值;
    根据计算得到的比值,确定所述第一元素对应的图像单元格的颜色深度。
  21. 根据权利要求20所述的网络设备,所述处理器被所述机器可执行指令促使具体实现:
    针对所述第一元素,根据以下公式确定所述第一元素的转移概率:
    h=Log T;
    其中,h为所述第一元素的转移概率,T为计算得到的比值;
    将计算得到的所述第一元素的转移概率,确定为所述第一元素对应的图像单元格的颜色深度。
  22. 根据权利要求15所述的网络设备,所述处理器被所述机器可执行指令促使具体实现:
    将所述目标图像数据输入预先训练的卷积神经网络CNN模型,得到所述目标图像数据的特征;
    其中,所述CNN模型以经典CNN Lenet-5模型为基础,第一个卷积层包括32个卷积核,第二个卷积层包括64个卷积核,第二个池化层后面增加0.25的丢弃DropOut层,第一个全连接层后面增加0.5的DropOut层。
  23. 根据权利要求22所述的网络设备,所述处理器被所述机器可执行指令促使具体实现:
    将所述目标图像数据的特征输入预先训练的深度神经网络DNN模型,得到输出结果;其中,所述DNN模型用于对图像数据的特征进行识别,确定图像数据对应的文件是否为恶意文件,所述输出结果指示所述待识别文件是否为恶意文件。
  24. 根据权利要求22所述的网络设备,所述目标图像数据的特征为所述CNN模型的预设层的输出结果;
    所述处理器被所述机器可执行指令促使具体实现:所述根据所述目标图像数据的特征,确定所述待识别文件是否为恶意文件,包括:从预设恶意文件特征库中查找所述目标图像数据的特征;所述预设恶意文件特征库包括:多个样本恶意文件对应的图像数据的特征;若查找到,则确定所述待识别文件为恶意文件;
    若未查找到,则确定所述待识别文件为安全文件。
  25. 一种网络设备,包括处理器和机器可读存储介质,所述机器可读存储介质存储有能够被所述处理器执行的机器可执行指令,所述处理器被所述机器可执行指令促使实现:
    获取待识别文件;
    将所述待识别文件输入预先训练的文件识别模型,确定所述待识别文件是否为恶意文件;
    其中,所述文件识别模型用于:根据预设读取规则和预设词组模型,确定输入文件对应的多个字符串;根据所述多个字符串,构建转移矩阵,所述转移矩阵中的元素与字符串种类一一对应;根据所述转移矩阵中的元素,确定所述输入文件对应的目标图像数据;提取所述目标图像数据的特征,并根据所述目标图像数据的特征,确定所述输入文件是否为恶意文件。
  26. 一种网络设备,包括处理器和机器可读存储介质,所述机器可读存储介质存储有能够被所述处理器执行的机器可执行指令,所述处理器被所述机器可执行指令促使实现:
    将多个样本文件分别输入文件识别模型;其中,所述文件识别模型用于:根据预设读取规则和预设词组模型,确定输入文件对应的多个字符串;根据所述多个字符串,构建转移矩阵,所述转移矩阵中的元素与字符串种类一一对应;根据所述转移矩阵中的元素,确定所述输入文件对应的目标图像数据;利用卷积神经网络CNN模型提取所述目标图像数据的特征,并利用深度神经网络DNN模型对所述目标图像数据的特征进行识别,确定所述输入文件是否为恶意文件;
    针对每一样本文件,提取所述CNN模型的预设层的输出结果,作为该样本文件的特征。
  27. 根据权利要求26所述的网络设备,所述样本文件为样本恶意文件;
    所述处理器被所述机器可执行指令促使实现:在针对每一样本文件,提取所述CNN模型的预设层的输出结果,作为该样本文件的特征之后,根据提取的多个特征,构建恶意文件特征库。
  28. 根据权利要求26所述的网络设备,所述处理器被所述机器可执行指令促使实现:将待识别文 件输入所述文件识别模型;获取所述CNN模型的预设层的输出结果,作为目标特征;从所述恶意文件特征库中查找所述目标特征;若查找到,则确定所述待识别文件为恶意文件;若未查找到,则确定所述待识别文件为安全文件。
PCT/CN2019/083200 2018-04-18 2019-04-18 文件识别方法和特征提取方法 WO2019201295A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201810349458.6 2018-04-18
CN201810349458.6A CN109753987B (zh) 2018-04-18 2018-04-18 文件识别方法和特征提取方法

Publications (1)

Publication Number Publication Date
WO2019201295A1 true WO2019201295A1 (zh) 2019-10-24

Family

ID=66402373

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/083200 WO2019201295A1 (zh) 2018-04-18 2019-04-18 文件识别方法和特征提取方法

Country Status (2)

Country Link
CN (1) CN109753987B (zh)
WO (1) WO2019201295A1 (zh)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111079528A (zh) * 2019-11-07 2020-04-28 国网辽宁省电力有限公司电力科学研究院 一种基于深度学习的图元图纸校核方法及系统
CN111582282A (zh) * 2020-05-13 2020-08-25 科大讯飞股份有限公司 一种文本识别方法、装置、设备及存储介质
CN113949582A (zh) * 2021-10-25 2022-01-18 绿盟科技集团股份有限公司 一种网络资产的识别方法、装置、电子设备及存储介质

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110516125B (zh) * 2019-08-28 2020-05-08 拉扎斯网络科技(上海)有限公司 识别异常字符串的方法、装置、设备及可读存储介质
CN111222856A (zh) * 2020-01-15 2020-06-02 深信服科技股份有限公司 一种邮件识别方法、装置、设备及存储介质
CN111310205B (zh) * 2020-02-11 2024-05-10 平安科技(深圳)有限公司 敏感信息的检测方法、装置、计算机设备和存储介质

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101046858A (zh) * 2006-03-29 2007-10-03 腾讯科技(深圳)有限公司 电子信息比较系统和方法以及反垃圾邮件系统
CN104216875A (zh) * 2014-09-26 2014-12-17 中国科学院自动化研究所 基于非监督关键二元词串提取的微博文本自动摘要方法
CN107392019A (zh) * 2017-07-05 2017-11-24 北京金睛云华科技有限公司 一种恶意代码家族的训练和检测方法及装置

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8499354B1 (en) * 2011-03-15 2013-07-30 Symantec Corporation Preventing malware from abusing application data
CN104751055B (zh) * 2013-12-31 2017-11-03 北京启明星辰信息安全技术有限公司 一种基于纹理的分布式恶意代码检测方法、装置及系统
CN105095755A (zh) * 2015-06-15 2015-11-25 安一恒通(北京)科技有限公司 文件识别方法和装置

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101046858A (zh) * 2006-03-29 2007-10-03 腾讯科技(深圳)有限公司 电子信息比较系统和方法以及反垃圾邮件系统
CN104216875A (zh) * 2014-09-26 2014-12-17 中国科学院自动化研究所 基于非监督关键二元词串提取的微博文本自动摘要方法
CN107392019A (zh) * 2017-07-05 2017-11-24 北京金睛云华科技有限公司 一种恶意代码家族的训练和检测方法及装置

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111079528A (zh) * 2019-11-07 2020-04-28 国网辽宁省电力有限公司电力科学研究院 一种基于深度学习的图元图纸校核方法及系统
CN111582282A (zh) * 2020-05-13 2020-08-25 科大讯飞股份有限公司 一种文本识别方法、装置、设备及存储介质
CN111582282B (zh) * 2020-05-13 2024-04-12 科大讯飞股份有限公司 一种文本识别方法、装置、设备及存储介质
CN113949582A (zh) * 2021-10-25 2022-01-18 绿盟科技集团股份有限公司 一种网络资产的识别方法、装置、电子设备及存储介质
CN113949582B (zh) * 2021-10-25 2023-05-30 绿盟科技集团股份有限公司 一种网络资产的识别方法、装置、电子设备及存储介质

Also Published As

Publication number Publication date
CN109753987B (zh) 2021-08-06
CN109753987A (zh) 2019-05-14

Similar Documents

Publication Publication Date Title
WO2019201295A1 (zh) 文件识别方法和特征提取方法
WO2020221298A1 (zh) 文本检测模型训练方法、文本区域、内容确定方法和装置
TWI682325B (zh) 辨識系統及辨識方法
JP6541673B2 (ja) モバイル機器におけるリアルタイム音声評価システム及び方法
WO2020073664A1 (zh) 指代消解方法、电子装置及计算机可读存储介质
CN108647736B (zh) 一种基于感知损失和匹配注意力机制的图像分类方法
US20110314294A1 (en) Password checking
CN104156349B (zh) 基于统计词典模型的未登录词发现和分词系统及方法
JP2020505643A (ja) 音声認識方法、電子機器、及びコンピュータ記憶媒体
WO2021208727A1 (zh) 基于人工智能的文本错误检测方法、装置、计算机设备
WO2021031825A1 (zh) 网络欺诈识别方法、装置、计算机装置及存储介质
CN111835763B (zh) 一种dns隧道流量检测方法、装置及电子设备
CN115380284A (zh) 非结构化文本分类
CN114050912B (zh) 一种基于深度强化学习的恶意域名检测方法和装置
WO2019238125A1 (zh) 信息处理方法、相关设备及计算机存储介质
CN116956835B (zh) 一种基于预训练语言模型的文书生成方法
WO2019201024A1 (zh) 用于更新模型参数的方法、装置、设备和存储介质
WO2021082861A1 (zh) 评分方法、装置、电子设备及存储介质
US11227110B1 (en) Transliteration of text entry across scripts
JP7149976B2 (ja) 誤り訂正方法及び装置、コンピュータ読み取り可能な媒体
US20160063336A1 (en) Generating Weights for Biometric Tokens in Probabilistic Matching Systems
WO2021253938A1 (zh) 一种神经网络的训练方法、视频识别方法及装置
CN114826681A (zh) 一种dga域名检测方法、系统、介质、设备及终端
CN113239683A (zh) 中文文本纠错方法、系统及介质
CN115858776B (zh) 一种变体文本分类识别方法、系统、存储介质和电子设备

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19787885

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19787885

Country of ref document: EP

Kind code of ref document: A1

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS (EPO FORM 1205A DATED 04.05.2021)

122 Ep: pct application non-entry in european phase

Ref document number: 19787885

Country of ref document: EP

Kind code of ref document: A1