WO2019201295A1 - 文件识别方法和特征提取方法 - Google Patents
文件识别方法和特征提取方法 Download PDFInfo
- Publication number
- WO2019201295A1 WO2019201295A1 PCT/CN2019/083200 CN2019083200W WO2019201295A1 WO 2019201295 A1 WO2019201295 A1 WO 2019201295A1 CN 2019083200 W CN2019083200 W CN 2019083200W WO 2019201295 A1 WO2019201295 A1 WO 2019201295A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- file
- image data
- identified
- transfer matrix
- feature
- Prior art date
Links
- 238000000034 method Methods 0.000 title claims abstract description 68
- 238000000605 extraction Methods 0.000 title claims abstract description 23
- 239000011159 matrix material Substances 0.000 claims abstract description 247
- 230000007704 transition Effects 0.000 claims abstract description 55
- 238000012546 transfer Methods 0.000 claims description 236
- 238000013527 convolutional neural network Methods 0.000 claims description 119
- 238000011176 pooling Methods 0.000 claims description 10
- 238000013528 artificial neural network Methods 0.000 claims description 5
- 238000004364 calculation method Methods 0.000 description 24
- 238000012549 training Methods 0.000 description 24
- 238000004891 communication Methods 0.000 description 21
- 230000008569 process Effects 0.000 description 18
- 238000010586 diagram Methods 0.000 description 15
- 238000012545 processing Methods 0.000 description 9
- 230000006872 improvement Effects 0.000 description 4
- 238000006243 chemical reaction Methods 0.000 description 3
- 230000001186 cumulative effect Effects 0.000 description 3
- 244000035744 Hura crepitans Species 0.000 description 2
- 238000010276 construction Methods 0.000 description 2
- 230000008859 change Effects 0.000 description 1
- 239000000284 extract Substances 0.000 description 1
- 230000006870 function Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000007935 neutral effect Effects 0.000 description 1
- 230000002093 peripheral effect Effects 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F21/00—Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
- G06F21/50—Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
- G06F21/55—Detecting local intrusion or implementing counter-measures
- G06F21/56—Computer malware detection or handling, e.g. anti-virus arrangements
Definitions
- Malicious code is a form of attack by an attacker.
- a file carrying malicious code is a malicious file, that is, a form in which a malicious file attacks an attacker.
- Malicious files use network service vulnerabilities to attack network servers for the purpose of stealing information and services.
- the process of file identification includes: obtaining a file to be identified, running the file to be identified in a sandbox, extracting an operation feature of the file to be identified, normalizing the extracted operation feature, and performing normalized processing.
- the feature input depth neural network English: Deep Neutral Network, referred to as: DNN
- DNN Deep Neutral Network
- the DNN model is trained by using the running characteristics of the file.
- FIG. 1 is a schematic diagram of a first process of a file identification method according to an embodiment of the present application
- FIG. 2 is a first schematic diagram of a transfer matrix provided by an embodiment of the present application.
- FIG. 3 is a second schematic diagram of a transfer matrix provided by an embodiment of the present application.
- FIG. 4 is a schematic diagram of image data based on the transfer matrix shown in FIG. 3;
- FIG. 5 is a schematic structural diagram of a convolutional neural network model according to an embodiment of the present application.
- FIG. 6 is a schematic flowchart of a model training method according to an embodiment of the present application.
- FIG. 7 is a second schematic flowchart of a file identification method according to an embodiment of the present disclosure.
- FIG. 8 is a schematic flowchart of a feature extraction method according to an embodiment of the present disclosure.
- FIG. 9 is a schematic diagram of a first structure of a file identification apparatus according to an embodiment of the present application.
- FIG. 10 is a schematic diagram of a second structure of a file identification apparatus according to an embodiment of the present disclosure.
- FIG. 11 is a schematic structural diagram of a feature extraction apparatus according to an embodiment of the present application.
- FIG. 12 is a schematic diagram of a first structure of a network device according to an embodiment of the present disclosure.
- FIG. 13 is a schematic diagram of a second structure of a network device according to an embodiment of the present disclosure.
- FIG. 14 is a schematic diagram of a third structure of a network device according to an embodiment of the present disclosure.
- the running feature of the file to be identified in the sandbox is set by the user according to experience, that is, the file recognition depends on the subjective factor of the person, and the accuracy of file recognition is low.
- the embodiment of the present application provides a file identification method.
- the file identification method can be applied to network devices such as firewall devices, routers, switches, and the like.
- the method can also be performed by a file identification device, which can be implemented in hardware and/or software, and can generally be integrated into a network device for file identification.
- the file identification method provided by the embodiment of the present application converts the file to be identified into image data, extracts features of the image data, and then determines whether the file to be identified is a malicious file according to the extracted feature.
- the feature of the image data is an objectively existing feature in the file to be identified, rather than being set according to experience, obtaining a file recognition result according to the objective existence feature, reducing the dependence of the file recognition on the subjective factor of the person, and improving the file The accuracy of the identification. Therefore, the file identification method provided by the embodiment of the present application is more accurate.
- FIG. 1 is a schematic diagram of a first process of a file identification method according to an embodiment of the present application, where the method includes the following process.
- the execution subject of the file identification method is a network device as an example.
- the file to be identified obtained by the network device may be: a file sent by another network device to the network device.
- the file to be identified obtained by the network device may also be: a file obtained from a locally stored file.
- section 102 determining a plurality of character strings corresponding to the file to be identified according to the preset reading rule and the preset phrase model.
- determining a plurality of character strings corresponding to the file to be identified according to the preset reading rule and the preset phrase model may include: reading the file to be identified according to the preset reading rule, and obtaining multiple A character that combines adjacent characters of a plurality of characters according to a preset phrase model to obtain a plurality of character strings.
- the reading rule may include: binary, octal, or hexadecimal, but is not limited to these types of reading rules.
- the preset phrase model may include a binary phrase (English: BiGram) model and/or a ternary word (English: TriGram) model.
- a transfer matrix is constructed according to a plurality of strings corresponding to the file to be identified. Among them, the elements in the transfer matrix correspond one-to-one with the type of the string.
- the type of the string is the type of the string, and the type of the string obtained is different depending on the reading rule and/or the phrase model.
- constructing the transfer matrix according to the plurality of character strings corresponding to the file to be identified may include: determining the number of occurrences of each character string in the plurality of character strings, according to the number of occurrences of each character string Construct a transfer matrix.
- the number of rows and the number of columns of the transfer matrix are the same, and the number of rows and columns of the transfer matrix are: the ratio of the number of string types to the number of character types.
- the number of the type of the string is: the number of types of the obtained string when the character string is determined according to the preset reading rule and the preset phrase model; the number of character types is: when the file is read according to the preset reading rule, The number of types of characters to be reached.
- the preset reading rule is hexadecimal
- the preset phrase model includes the BiGram model and the TriGram model.
- constructing the transfer matrix according to the number of occurrences of each character string may include: for each character string, the number of occurrences of the character string as the value of the element corresponding to the character string in the transfer matrix , get the transfer matrix.
- the BiGram model is taken as an example with a preset phrase model.
- the network device obtains the file f1 to be identified, reads the file f1 to be identified according to a preset reading rule, and obtains a plurality of characters: abcbbcdabcd.
- the adjacent characters of the plurality of characters corresponding to the file f1 to be identified are combined, and the obtained plurality of characters are: ab, bc, cb, bb, bc, cd, da, ab, bc, cd.
- the number of occurrences of each character string is: "ab” appears twice, “bc” appears as 3, “cb” appears as 1, "bb” appears as 1, "cd” appears For 2, the number of occurrences of "da” is 1. The number of occurrences of other strings is 0.
- each square in FIG. 2 represents an element of the matrix, and the horizontal corresponding to the square
- the character and the vertical character corresponding to the square form a string, which is the string corresponding to the square.
- constructing a transfer matrix according to the number of occurrences of each character string may include: calculating, for each character string, the number of occurrences of the character string and The sum value of the initial value is preset, and the calculated sum value is used as the value of the element corresponding to the character string in the transfer matrix to obtain a transfer matrix.
- each square in FIG. 3 represents an element of the matrix, and the horizontal corresponding to the square
- the character and the vertical character corresponding to the square form a string, which is the string corresponding to the square.
- section 104 determining target image data corresponding to the file to be identified according to the elements in the transfer matrix.
- the target image data corresponding to the file to be identified is determined according to the elements in the transfer matrix.
- one element in the transfer matrix corresponds to one image cell
- the target image data corresponding to the file to be identified is determined, that is, the value of each element in the transfer matrix is converted into image data.
- the color depth of the image cell corresponding to each element in the transfer matrix is calculated according to the value of each element in the transfer matrix, and the target image data corresponding to the file to be identified is obtained.
- the above image cell is the smallest unit of image processing.
- the color depth is the gray value of the point in the black and white image. In the embodiment of the present application, the color depth is taken as the value of the image cell.
- the color depth ranges from 0 to 255, white is 255, and black is 0.
- the range of the color depth is not limited in the embodiment of the present application, that is, the color depth may be an integer, may be a decimal number, may be a positive number, or may be a negative number.
- the color depth of the image cell corresponding to each element may be determined in the following manner.
- the value of the first element is a first value; wherein the first element is any element in the transfer matrix, and the value of the first element is based on the number of occurrences of the first string determine.
- the first string is a string corresponding to the first element in the transfer matrix.
- the sum of the values of all the second elements is determined to be the second value.
- the value of the second element is determined according to the number of occurrences of the second character string, and the head word of the second character string is the same as the head word of the first character string.
- the first character string is included in the second character string.
- the first word is the first character.
- the color depth of the image cell corresponding to the first element in the transfer matrix is determined.
- the calculated ratio ie, the ratio of the first value to the second value
- the image corresponding to the first element in the transfer matrix may be used as the image corresponding to the first element in the transfer matrix.
- the color depth of the cell may be used for each element in the transfer matrix.
- the transition probability of the first element may be determined according to the following formula:
- T the calculated ratio, ie the ratio of the first value to the second value.
- h the transition probability of the first element
- T the calculated ratio, ie the ratio of the first value to the second value.
- the calculated transition probability of the first element is determined as the color depth of the image cell corresponding to the first element.
- section 105 extracting features of the target image data, and determining whether the file to be identified is a malicious file according to characteristics of the target image data.
- a feature of the target image data may be extracted using a Convolutional Neural Network (CNN) model.
- CNN Convolutional Neural Network
- the CNN model adopted in the embodiment of the present application can be improved based on the classic CNN Lenet-5 model and based on the classic CNN Lenet-5 structure.
- Lenet-5 is a classic CNN network architecture, including 3 convolutional layers, 2 pooling layers and 2 fully connected layers.
- the improvement of the Lenet-5 structure is shown in FIG.
- the first convolutional layer includes 32 convolution kernels
- the second convolutional layer includes 64 convolution kernels.
- DropOut Add a 0.25 drop (English: DropOut) layer to the second pooling layer, and add 0.5 to the DropOut layer after the first fully connected layer.
- the DropOut layer can also be called the Discard layer.
- the feature of the target image data may be identified by using the DNN model, that is, the DNN model is used to identify the identified file by using the feature of the target image data to determine whether the file to be identified is a malicious file.
- the feature of the target image data is input into the pre-trained DNN model to obtain an output result, wherein the output result indicates whether the file to be identified is a malicious file.
- the output result indicates that the file to be identified is a malicious file, or the output result indicates that the file to be identified is a non-malicious file.
- a non-malicious file is a secure file.
- the DNN model inputting the feature of the target image data into the DNN model, obtaining a first probability that the file to be identified is a security file, and a second probability that the file to be identified is a malicious file. If the first probability is greater than the second probability, the output of the DNN model indicates that the file to be identified is a security file. Otherwise, the output of the DNN model indicates that the file to be identified is a malicious file.
- the feature of the image data is used to determine whether the file to be identified is a malicious file.
- the characteristics of the image data are the characteristics of the objective file to be identified, rather than being set according to experience. According to the recognition result of the objective existence feature, the dependence of the document recognition on the subjective factors of the person is reduced, and the accuracy of the document recognition is improved. .
- the DNN model and the CNN model may be pre-trained before the identification of the file to be recognized.
- the model training method shown in FIG. 6 The method includes the following process.
- the initialized parameter set may be represented by ⁇ i .
- the parameters of the initialization can be set according to actual needs and experience. i is the number of times/cumulative times of the current forward calculation.
- the initialized parameter set may be Said.
- the parameters of the initialization can be set according to actual needs and experience. i is the number of times/cumulative times of the current forward calculation.
- training-related high-level parameters such as learning rate, gradient descent algorithm, back propagation algorithm, etc.
- the training-related high-level parameters may be set in various manners in the related art, and are not described in detail herein.
- the preset training set includes a sample file and a label of the sample file, and the label may include: a first label for indicating that the file is a malicious file and a second label for indicating that the file is a non-malicious file.
- the sample file can be a binary file.
- the sample file included in the preset training set may be obtained from the network through a web crawler or the like, or may be obtained from a pre-acquired sample file library, which is not limited by the embodiment of the present application.
- the order of execution of the 601, 602, and 603 portions is not limited in the embodiment of the present application.
- Section 604 Convert each sample file in the preset training set to image data.
- Section 605 Perform a forward calculation as follows.
- the image data of each sample file obtained in Section 604 is input to a preset CNN model to obtain features of the image data corresponding to the sample file.
- the feature outputted by the preset CNN model is input into a preset DNN model to obtain an output result corresponding to the sample file.
- the output indicates that the sample file is a secure file or indicates that the sample file is a malicious file.
- the third probability that the sample file is a security file and the fourth probability that the sample file is a malicious file are obtained. If the third probability is greater than the fourth probability, determining that the output result corresponding to the sample file is the security file of the sample file; otherwise, determining that the output result corresponding to the sample file is the malicious file of the sample file.
- the current parameter set is ⁇ 1
- the current parameter set ⁇ i is obtained by adjusting the parameter set ⁇ i-1 used last time, and the current parameter set is obtained.
- the last used parameter set For the adjustment, please refer to the following description.
- the loss value is calculated based on the label of each sample file and the output corresponding to the preset DNN model.
- the mean square error (English: Mean Squared Error, MSE) formula can be used as the loss function to obtain the loss value L( ⁇ i ), as shown in the following formula:
- H represents the number of sample files selected from the preset training set in a single training
- I j represents the characteristics of the image data corresponding to the jth sample file
- ⁇ i ) represents the jth sample file.
- the output result of the forward calculation of the DNN model under the parameter set ⁇ i , X j represents the label of the jth sample file, and i is the number of times/cumulative times of the current forward calculation.
- the preset model includes a CNN model and a preset DNN model.
- the convergence may be determined when the value of the loss is less than the threshold value of the preset loss value.
- the convergence may be determined when the difference between the value of the loss and the value of the previous calculation is less than the preset change threshold. There is no limit here.
- section 608 on the current parameter set ⁇ i and The parameters in the adjustment are adjusted to get the adjusted parameter set, and then enter the 605 part for the next forward calculation.
- the back propagation algorithm can be used to adjust the parameters in the current parameter set.
- the current parameter set ⁇ i is taken as the final parameter set of the output ⁇ final
- the current parameter set The final parameter set as output
- the preset DNN model of the final parameter set ⁇ final will be used as the trained DNN model.
- Final parameter set The preset CNN model is used as a trained CNN model.
- the training of the above CNN model and DNN model can be implemented on the same network device as the file identification.
- the trained network devices of the CNN model and the DNN model may be different from the network devices identified by the file.
- the feature of the target image data may be identified by using a malicious file feature library to determine whether the file to be identified is a malicious file.
- the malicious file feature library includes: features of the image data corresponding to the plurality of sample malicious files. Specifically, the target image data is input into the CNN model, and the output result of the preset layer of the CNN model is acquired as a feature of the target image data. Find the characteristics of the target image data from the preset malicious file feature library. If found, it is determined that the file to be identified is a malicious file. If not found, it is determined that the file to be identified is a security file.
- the image data corresponding to the sample malicious file can be input into the CNN model to obtain the prediction of the CNN model.
- the output result of the layer is used to take the output result of the preset layer of the CNN model as the feature of the image data corresponding to the sample malicious file.
- a malicious file signature database is constructed from the characteristics of the image data corresponding to the plurality of sample malicious files.
- the preset layer may be the third volume of the CNN model. Layered, as shown in Figure 4.
- the feature length of the third convolutional layer output is 512 bytes.
- the features in the malicious file feature library are directly extracted from the malicious file, if the feature of the file to be identified matches the feature in the malicious file feature database, the file to be identified can be determined as a malicious file, and the accuracy of the file recognition is improved. . In addition, compared with the calculation amount of the DNN model identification feature, the feature calculation in the matching malicious file feature library is much smaller, and the efficiency of file recognition is improved.
- FIG. 7 is a second schematic flowchart of a file identification method according to an embodiment of the present application, including the following process.
- the execution subject of the file identification method is a network device as an example.
- the file to be identified obtained by the network device may be: a file sent by another network device to the network device.
- the file to be identified obtained by the network device may also be: a file obtained from a locally stored file.
- the file to be identified is input into a pre-trained file recognition model to determine whether the file to be identified is a malicious file.
- the file recognition model is configured to: determine a plurality of character strings corresponding to the input file according to the preset reading rule and the preset phrase model; construct a transfer matrix according to the plurality of strings, and convert the elements and the string type in the matrix Corresponding to; determining the target image data corresponding to the input file according to the elements in the transfer matrix; extracting features of the target image data, and determining whether the input file is a malicious file according to the characteristics of the target image data.
- the input file is a file that identifies the model of the input file.
- the input file is the file to be identified.
- the type of the string is the type of the string, and the type of the string obtained is different depending on the reading rule and/or the phrase model.
- determining a plurality of character strings corresponding to the input file according to the preset reading rule and the preset phrase model may include: reading an input file according to a preset reading rule to obtain a plurality of characters, According to the preset phrase model, a plurality of characters are obtained by combining adjacent characters of a plurality of characters.
- the reading rule may include: binary, octal, or hexadecimal, but is not limited to these types of reading rules.
- the preset phrase model may include a BiGram model and/or a TriGram model.
- constructing the transfer matrix according to the plurality of character strings corresponding to the input file may include: determining the number of occurrences of each character string in the plurality of character strings, according to the number of occurrences of each character string, Construct a transfer matrix.
- the number of rows and the number of columns of the transfer matrix are the same, and the number of rows and columns of the transfer matrix are: the ratio of the number of string types to the number of character types.
- the number of the type of the string is: the number of types of the string that can be obtained when the character string is determined according to the preset reading rule and the preset phrase model; the number of character types is: when the file is read according to the preset reading rule, The number of types of characters that can be obtained.
- the preset reading rule is hexadecimal
- the preset phrase model may include a BiGram model and a TriGram model.
- the number of rows and the number of columns of the transfer matrix are the same, and the elements in the transfer matrix correspond one-to-one with the type of the string, the number of rows and the number of columns of the transfer matrix may be 272. That is, a transfer matrix of 272*272 can be constructed according to the number of occurrences of each character string corresponding to the input file.
- constructing the transfer matrix according to the number of occurrences of each character string may include: for each character string, the number of occurrences of the character string as the value of the element corresponding to the character string in the transfer matrix , get the transfer matrix.
- constructing a transfer matrix according to the number of occurrences of each character string may include: calculating, for each character string, the number of occurrences of the character string and The sum value of the initial value is preset, and the calculated sum value is used as the value of the element corresponding to the character string in the transfer matrix to obtain a transfer matrix.
- one element in the transfer matrix corresponds to one image cell
- the target image data corresponding to the input file is determined, that is, the value of each element in the transfer matrix is converted into image data.
- determining the target image data corresponding to the input file according to the elements in the transfer matrix may include: calculating a color depth of the image cell corresponding to each element in the transfer matrix according to the value of each element in the transfer matrix, and obtaining an input file corresponding to the input file Target image data.
- the above image cell is the smallest unit of image processing.
- the color depth is the gray value of the point in the black and white image. In the embodiment of the present application, the color depth is taken as the value of the image cell.
- the color depth of the image cell corresponding to each element may be determined in the following manner. Specifically, calculating the color depth of the image cell corresponding to each element in the transfer matrix according to the value of each element in the transfer matrix may include: determining, for the first element in the transfer matrix, a value of the first element as the first value.
- the first element is any element in the transfer matrix, and the value of the first element is determined according to the number of occurrences of the first character string.
- the first string is a string corresponding to the first element in the transfer matrix.
- the sum of the values of all the second elements is determined to be the second value.
- the value of the second element is determined according to the number of occurrences of the second character string, and the head word of the second character string is the same as the head word of the first character string.
- the first word is the first character.
- the color depth of the image cell corresponding to the first element in the transfer matrix is determined.
- the calculated ratio may be used as the color depth of the image cell corresponding to the first element in the transfer matrix.
- the transition probability of the first element may be determined according to the following formula:
- h is the transition probability of the first element
- T is the calculated ratio, ie the ratio of the first value to the second value.
- the calculated transition probability of the first element is determined as the color depth of the image cell corresponding to the first element.
- extracting features of the target image data may include: inputting the target image data into the pre-trained CNN model to obtain features of the target image data.
- the adopted CNN model can be improved based on the classical CNN Lenet-5 model and based on the classical CNN Lenet-5 structure.
- Lenet-5 is a classic CNN network architecture, including 3 convolutional layers, 2 pooling layers and 2 fully connected layers.
- the improvement of the Lenet-5 structure is shown in FIG.
- the first convolutional layer includes 32 convolution kernels
- the second convolutional layer includes 64 convolution kernels.
- the feature of the target image data may be identified by using the DNN model, that is, the DNN model is used to identify the identified file by using the feature of the target image data to determine whether the input file is a malicious file.
- determining whether the input file is a malicious file according to the feature of the target image data may include: inputting the feature of the target image data into the pre-trained DNN model to obtain an output result; wherein the DNN model is used to perform the feature of the image data. Identifying whether the file corresponding to the image data is a malicious file, and the output result indicates whether the input file is a malicious file.
- inputting features of the target image data into the DNN model yields a first probability that the input file is a secure file and a second probability that the input file is a malicious file. If the first probability is greater than the second probability, the output of the DNN model indicates that the input file is a secure file. Otherwise, the output of the DNN model indicates that the input file is a malicious file.
- the DNN model and the CNN model may be pre-trained before the identification of the file to be recognized.
- the training process of the DNN model and the CNN model can be described with reference to the description of sections 601-609 of the embodiment shown in FIG. 6.
- the feature of the target image data may be identified by using a malicious file feature library to determine whether the file to be identified is a malicious file.
- the malicious file feature library includes: features of the image data corresponding to the plurality of sample malicious files. Specifically, the target image data is input into the CNN model, and the output result of the preset layer of the CNN model is acquired as a feature of the target image data. Find the characteristics of the target image data from the preset malicious file feature library. If found, it is determined that the input file is a malicious file. If not found, make sure the input file is a secure file.
- the image data corresponding to the sample malicious file can be input into the CNN model to obtain the prediction of the CNN model.
- the output result of the layer is used to take the output result of the preset layer of the CNN model as the feature of the image data corresponding to the sample malicious file.
- a malicious file signature library is constructed from the characteristics of the image data corresponding to the plurality of sample malicious files.
- the preset layer may be the third volume of the CNN model. Layered, as shown in Figure 4.
- the feature length of the third convolutional layer output is 512 bytes.
- the features in the malicious file feature library are directly extracted from the malicious file, if the feature of the file to be identified matches the feature in the malicious file feature database, the file to be identified can be determined as a malicious file, and the accuracy of the file recognition is improved. . In addition, compared with the calculation amount of the DNN model identification feature, the feature calculation in the matching malicious file feature library is much smaller, and the efficiency of file recognition is improved.
- the file to be identified when file identification is performed, the file to be identified is converted into image data, and the feature of the image data is extracted, and according to the extracted feature, whether the file to be identified is a malicious file is determined.
- the image data is characterized by the objective existence of the file to be identified, and is not set according to experience, and the file recognition result is obtained according to the objective existence feature, which reduces the dependence of the file recognition on the subjective factors of the person and improves the file recognition. The accuracy.
- FIG. 8 is a schematic flowchart of a feature extraction method according to an embodiment of the present application. The method includes the following process.
- Section 801 Multiple sample files are entered into the file recognition model.
- the file identification model is configured to: determine a plurality of character strings corresponding to the input file according to the preset reading rule and the preset phrase model; construct a transfer matrix according to the plurality of character strings corresponding to the input file; according to the transfer matrix of the file.
- the element determines the target image data corresponding to the input file, wherein the elements in the transfer matrix correspond one-to-one with the type of the string; the feature of the input target image data is extracted by using the CNN model, and the feature of the target image data is identified by using the DNN model.
- the type of the string is the type of the string, and the type of the obtained string is different according to different reading rules and/or phrase models.
- the input file is a file that identifies the model of the input file.
- the multiple sample files are input files.
- the DNN model and the CNN model are trained before the feature is extracted.
- the training process of the DNN model and the CNN model can be described with reference to the description of sections 601-609 of the embodiment shown in FIG. 6.
- Section 802 For each sample file, the output of the preset layer of the CNN model is extracted as a feature of the sample file.
- determining a plurality of character strings corresponding to the input file according to the preset reading rule and the preset phrase model may include: reading the input file according to the preset reading rule to obtain multiple characters. According to the preset phrase model, a plurality of characters are obtained by combining adjacent characters of a plurality of characters.
- the reading rule may include: binary, octal, or hexadecimal, but is not limited to these types of reading rules.
- the preset phrase model can include a BiGram model and/or a TriGram model.
- constructing the transfer matrix according to the plurality of character strings corresponding to the input file may include: determining the number of occurrences of each character string in the plurality of character strings; according to the number of occurrences of each character string, Construct a transfer matrix.
- the number of rows and the number of columns of the transfer matrix are the same, and the number of rows and columns of the transfer matrix are: the ratio of the number of string types to the number of character types.
- the number of the type of the string is: the number of types of the string that can be obtained when the character string is determined according to the preset reading rule and the preset phrase model; the number of character types is: when the file is read according to the preset reading rule, The number of types of characters that can be obtained.
- the preset reading rule is hexadecimal
- the preset phrase model may include a BiGram model and a TriGram model.
- the number of rows and the number of columns of the transfer matrix are the same, and the elements in the transfer matrix correspond one-to-one with the type of the string, the number of rows and the number of columns of the transfer matrix may be 272. That is, a transfer matrix of 272*272 can be constructed according to the number of occurrences of each character string corresponding to the input file.
- constructing the transfer matrix according to the number of occurrences of each character string may include: for each character string, the number of occurrences of the character string as the value of the element corresponding to the character string in the transfer matrix , get the transfer matrix.
- constructing a transfer matrix according to the number of occurrences of each character string may include: calculating, for each character string, the number of occurrences of the character string and The sum value of the initial value is preset, and the calculated sum value is used as the value of the element corresponding to the character string in the transfer matrix to obtain a transfer matrix.
- one element in the transfer matrix corresponds to one image cell
- the target image data corresponding to the input file is determined, that is, the value of each element in the transfer matrix is converted into image data.
- determining the target image data corresponding to the input file according to the elements in the transfer matrix may include: calculating a color depth of the image cell corresponding to each element in the transfer matrix according to the value of each element in the transfer matrix, and obtaining an input file corresponding to the input file Target image data.
- the above image cell is the smallest unit of image processing.
- the color depth is the gray value of the point in the black and white image. In the embodiment of the present application, the color depth is taken as the value of the image cell.
- the color depth of the image cell corresponding to each element may be determined in the following manner. Specifically, the color depth of the image cell corresponding to each element in the transfer matrix is calculated according to the value of each element in the transfer matrix, and may include:
- the value of the first element is determined to be the first value.
- the first element is any element in the transfer matrix, and the value of the first element is determined according to the number of occurrences of the first character string.
- the first string is a string corresponding to the first element in the transfer matrix.
- the sum of the values of all the second elements is determined to be the second value.
- the value of the second element is determined according to the number of occurrences of the second character string, and the head word of the second character string is the same as the head word of the first character string.
- the first character string is included in the second character string.
- the first word is the first character.
- the color depth of the image cell corresponding to the first element in the transfer matrix is determined.
- the calculated ratio may be used as the color depth of the image cell corresponding to the first element in the transfer matrix.
- the transition probability of the first element may be determined according to the following formula:
- h is the transition probability of the first element
- T is the calculated ratio, ie the ratio of the first value to the second value.
- the calculated transition probability of the first element is determined as the color depth of the image cell corresponding to the first element.
- the adopted CNN model can be improved based on the classical CNN Lenet-5 model and based on the classical CNN Lenet-5 structure.
- Lenet-5 is a classic CNN network architecture, including 3 convolutional layers, 2 pooling layers and 2 fully connected layers.
- the improvement of the Lenet-5 structure is shown in FIG.
- the first convolutional layer includes 32 convolution kernels
- the second convolutional layer includes 64 convolution kernels.
- the sample file is a sample malicious file.
- the method may further include: constructing a malicious file feature library according to the extracted multiple features.
- the preset layer may be the third volume of the CNN model. Laminated.
- the feature length of the third convolutional layer output is 512 bytes.
- the malicious file feature library may be used to identify the identified file to determine whether the file to be identified is a malicious file.
- the file to be identified is input into the file recognition model; the output result of the preset layer of the CNN model in the file recognition model is obtained as a target feature; and the target feature is searched from the malicious file feature database. If found, it is determined that the file to be identified is a malicious file. If not found, it is determined that the file to be identified is a security file.
- the feature of the preset layer output of the CNN model in the recognition model obtained by the pre-training is extracted, and the feature of extracting the file is not required to be manually analyzed, thereby improving the efficiency of feature extraction and reducing the labor cost.
- a malicious file feature library is constructed based on the extracted features of the malicious file, and the identified file is identified based on the malicious file feature library. Since the features included in the malicious file feature library are directly extracted from the malicious file, if the feature of the file to be identified matches the feature in the malicious file feature database, the file to be identified can be determined as a malicious file, which improves the accuracy of the file identification. Sex. In addition, compared with the calculation amount of the DNN model identification feature, the feature calculation in the matching malicious file feature library is much smaller, and the efficiency of file recognition is improved.
- FIG. 9 is a schematic diagram of a first structure of a file identification apparatus according to an embodiment of the present disclosure, where the apparatus includes:
- the obtaining module 901 is configured to obtain a file to be identified
- the first determining module 902 is configured to determine, according to the preset reading rule and the preset phrase model, a plurality of character strings corresponding to the file to be identified;
- the construction module 903 is configured to construct a transfer matrix according to the plurality of character strings; wherein the elements in the transfer matrix have a one-to-one correspondence with the type of the string;
- a second determining module 904 configured to determine, according to an element in the transfer matrix, target image data corresponding to the file to be identified;
- the identification module 905 is configured to extract features of the target image data, and determine, according to characteristics of the target image data, whether the file to be identified is a malicious file.
- the above-mentioned character string type is a type of a character string, and the types of character strings acquired are different depending on different reading rules and/or phrase models.
- the first determining module 902 may be specifically configured to:
- a plurality of characters are obtained by combining adjacent characters of a plurality of characters.
- the building module 903 is specifically configured to:
- a transfer matrix is constructed based on the number of occurrences of each string.
- the building module 903 is specifically configured to:
- the number of occurrences of the string is used as the value of the element corresponding to the string in the transfer matrix to obtain a transfer matrix
- the sum of the number of occurrences of the string and the preset initial value is calculated, and the calculated sum value is used as the value of the element corresponding to the string in the transfer matrix to obtain a transfer matrix.
- the second determining module 904 is specifically configured to:
- the color depth of the image module corresponding to each element in the transfer matrix is calculated according to the value of each element in the transfer matrix, and the target image data corresponding to the file to be identified is obtained.
- the second determining module 904 is specifically configured to:
- the color depth of the image cell corresponding to the first element is determined according to the calculated ratio.
- the first character string is a character string corresponding to the first element in the transfer matrix.
- the second determining module 904 is specifically configured to:
- the transition probability of the first element is determined according to the following formula:
- the calculated transition probability of the first element is determined as the color depth of the image module corresponding to the first element.
- the identification module 905 may be specifically configured to: input target image data into a pre-trained CNN model to obtain features of the target image data;
- the CNN model is based on the classic CNN Lenet-5 model.
- the first convolutional layer consists of 32 convolution kernels
- the second convolutional layer consists of 64 convolution kernels
- the second pooled layer adds 0.25.
- a DropOut layer 0.5 is added after the first fully connected layer.
- the identification module 905 is specifically configured to:
- the feature of the target image data is input into the pre-trained DNN model to obtain an output result; wherein the DNN model is used to identify the file by using the feature of the image data, determine whether the file corresponding to the image data is a malicious file, and output the result indicating the file to be identified. Whether it is a malicious file.
- the feature of the target image data is an output result of a preset layer of the CNN model
- the identification module 905 can be specifically used to:
- the preset malicious file feature library includes: a feature of the image data corresponding to the plurality of sample malicious files;
- the file to be identified is a security file.
- the feature of the image data corresponding to the plurality of sample malicious files may be obtained by inputting the image data corresponding to the sample malicious file into the CNN model for each sample malicious file, and the CNN model is The preset layer corresponds to the output result as a feature of the corresponding image data.
- the file to be identified when file identification is performed, the file to be identified is converted into image data, and the feature of the image data is extracted, and according to the extracted feature, whether the file to be identified is a malicious file is determined.
- the image data is characterized by the objective existence of the file to be identified, and is not set according to experience, and the file recognition result is obtained according to the objective existence feature, which reduces the dependence of the file recognition on the subjective factors of the person and improves the file recognition. The accuracy.
- FIG. 10 is a schematic diagram of a second structure of a file identification apparatus according to an embodiment of the present disclosure.
- the apparatus includes: an obtaining module 1001, an input module 1002, and a file identification model, where the file identification model includes: a first determining module 1003, a building module 1004, a second determining module 1005 and an identifying module 1006;
- the obtaining module 1001 is configured to obtain a file to be identified
- the input module 1002 is configured to input the file to be identified into the pre-trained file recognition model
- the first determining module 1003 is configured to determine, according to the preset reading rule and the preset phrase model, a plurality of character strings corresponding to the input file;
- the construction module 1004 is configured to construct a transfer matrix according to the plurality of character strings corresponding to the input file; the elements in the transfer matrix are in one-to-one correspondence with the type of the string;
- a second determining module 1005 configured to determine, according to an element in the transfer matrix, target image data corresponding to the input file
- the identification module 1006 is configured to extract features of the target image data, and determine, according to characteristics of the target image data, whether the input file is a malicious file.
- the above-mentioned character string type is a type of a character string, and the types of character strings acquired are different depending on different reading rules and/or phrase models.
- the first determining module 1003 may be specifically configured to:
- the input file is read according to a preset reading rule to obtain a plurality of characters; according to the preset phrase model, adjacent characters of the plurality of characters are combined to obtain a plurality of character strings.
- the building module 1004 is specifically configured to:
- the building module 1004 is specifically configured to:
- the number of occurrences of the string is used as the value of the element corresponding to the string in the transfer matrix to obtain a transfer matrix
- the sum of the number of occurrences of the string and the preset initial value is calculated, and the calculated sum value is used as the value of the element corresponding to the string in the transfer matrix to obtain a transfer matrix.
- the second determining module 1005 may be specifically configured to:
- the color depth of the image module corresponding to each element in the transfer matrix is calculated according to the value of each element in the transfer matrix, and the target image data corresponding to the input file is obtained.
- the second determining module 1005 may be specifically configured to:
- the color depth of the image cell corresponding to the first element is determined according to the calculated ratio.
- the first character string is a character string corresponding to the first element in the transfer matrix.
- the second determining module 1005 may be specifically configured to:
- the transition probability of the first element is determined according to the following formula:
- the calculated transition probability of the first element is determined as the color depth of the image module corresponding to the first element.
- the identification module 1006 may be specifically configured to: input target image data into a pre-trained CNN model to obtain features of the target image data;
- the CNN model is based on the classic CNN Lenet-5 model.
- the first convolutional layer consists of 32 convolution kernels
- the second convolutional layer consists of 64 convolution kernels
- the second pooled layer adds 0.25.
- a DropOut layer 0.5 is added after the first fully connected layer.
- the identification module 1006 may be specifically configured to:
- the feature of the target image data is input into the pre-trained DNN model to obtain an output result; wherein the DNN model is used to identify the file by using the feature of the image data to determine whether the file corresponding to the image data is a malicious file, and the output result indicates whether the input file is For malicious files.
- the feature of the target image data is an output result of a preset layer of the CNN model
- the identification module 1006 can be specifically used to:
- the preset malicious file feature library includes: a feature of the image data corresponding to the plurality of sample malicious files;
- the feature of the image data corresponding to the plurality of sample malicious files may be obtained by inputting the image data corresponding to the sample malicious file into the CNN model for each sample malicious file, and the CNN model is The preset layer corresponds to the output result as a feature of the corresponding image data.
- the file to be identified when file identification is performed, the file to be identified is converted into image data, and the feature of the image data is extracted, and according to the extracted feature, whether the file to be identified is a malicious file is determined.
- the image data is characterized by the objective existence of the file to be identified, and is not set according to experience, and the file recognition result is obtained according to the objective existence feature, which reduces the dependence of the file recognition on the subjective factors of the person and improves the file recognition. The accuracy.
- FIG. 11 is a schematic structural diagram of a feature extraction apparatus according to an embodiment of the present disclosure.
- the device includes: an input module 1101, an extraction module 1102, and a file recognition model.
- the file identification model includes a first determining module 1103, and a first The building module 1104, the second determining module 1105, and the first identifying module 1106.
- the input module 1101 is configured to input multiple sample files into the file recognition model respectively;
- the first determining module 1103 is configured to determine, according to the preset reading rule and the preset phrase model, a plurality of character strings corresponding to the input file;
- the first constructing module 1104 is configured to construct a transfer matrix according to the plurality of character strings corresponding to the input file; the elements in the transfer matrix are in one-to-one correspondence with the type of the character string;
- a second determining module 1105 configured to determine, according to an element in the transfer matrix, target image data corresponding to the input file
- the first identification module 1106 is configured to extract features of the input target image data by using the CNN model, and identify the features of the target image data by using the DNN model to determine whether the input file is a malicious file;
- the extracting module 1102 is configured to extract, for each sample file, an output result of the feature outputted by the preset layer of the CNN model as a feature of the sample file.
- the above-mentioned character string type is a type of a character string, and the types of character strings acquired are different depending on different reading rules and/or phrase models.
- the above uses the DNN model to identify the characteristics of the target image data, and determines whether the input file is a malicious file, that is, the DNN model uses the characteristics of the image data to identify the input file, and determines whether the input file is a malicious file.
- the first determining module 1103 may be specifically configured to:
- the input file is read according to a preset reading rule to obtain a plurality of characters; according to the preset phrase model, adjacent characters of the plurality of characters are combined to obtain a plurality of character strings.
- the first building module 1104 may be specifically configured to:
- the first building module 1104 may be specifically configured to:
- the number of occurrences of the string is used as the value of the element corresponding to the string in the transfer matrix to obtain a transfer matrix
- the sum of the number of occurrences of the string and the preset initial value is calculated, and the calculated sum value is used as the value of the element corresponding to the string in the transfer matrix to obtain a transfer matrix.
- the second determining module 1105 may be specifically configured to:
- the color depth of the image module corresponding to each element in the transfer matrix is calculated, and the target image data corresponding to the input component is obtained.
- the second determining module 1105 may be specifically configured to:
- the color depth of the image cell corresponding to the first element is determined according to the calculated ratio.
- the first character string is a character string corresponding to the first element in the transfer matrix.
- the second determining module 1105 may be specifically configured to:
- the transition probability of the first element is determined according to the following formula:
- the calculated transition probability of the first element is determined as the color depth of the image module corresponding to the first element.
- the CNN model is based on the classical CNN Lenet-5 model
- the first convolutional layer includes 32 convolution kernels
- the second convolutional layer includes 64 convolution kernels
- the second A 0.25 DropOut layer is added behind the pooling layer
- a DropOut layer of 0.5 is added after the first fully connected layer.
- the sample file is a sample malicious file
- the feature extraction device may further include: a second building module, configured to extract, according to each sample file, an output result of a preset layer of the CNN model, and as a feature of the sample file, construct a malicious file according to the extracted multiple features.
- a second building module configured to extract, according to each sample file, an output result of a preset layer of the CNN model, and as a feature of the sample file, construct a malicious file according to the extracted multiple features.
- the feature extraction device may further include: a second identification module, configured to:
- the feature of the preset layer output of the CNN model in the recognition model obtained by the pre-training of the file is extracted, and the feature of extracting the file is not required to be manually analyzed, thereby improving the efficiency of feature extraction and reducing the labor cost.
- a malicious file feature library is constructed based on the extracted features of the malicious file, and the identified file is identified based on the malicious file feature library. Since the features included in the malicious file feature library are directly extracted from the malicious file, if the feature of the file to be identified matches the feature in the malicious file feature database, the file to be identified can be determined as a malicious file, which improves the accuracy of the file identification. Sex. In addition, compared with the calculation amount of the DNN model identification feature, the feature calculation in the matching malicious file feature library is much smaller, and the efficiency of file recognition is improved.
- the embodiment of the present application further provides a network device, as shown in FIG. 12, including a processor 1201 and a machine readable storage medium 1202, which are stored and executable by the processor 1201.
- Machine executable instructions The processor 1201 is caused by machine executable instructions to implement the file identification method illustrated in FIG. 1 above. Specifically, the processor 1201 is caused to be implemented by machine executable instructions:
- the above-mentioned character string type is a type of a character string, and the types of character strings acquired are different depending on different reading rules and/or phrase models.
- the processor 1201 is caused by machine executable instructions to specifically implement:
- the file to be identified is read according to a preset reading rule to obtain a plurality of characters; according to the preset phrase model, adjacent characters of the plurality of characters are combined to obtain a plurality of character strings.
- the processor 1201 is caused by machine executable instructions to specifically implement:
- the processor 1201 is caused by machine executable instructions to specifically implement:
- the number of occurrences of the string is used as the value of the element corresponding to the string in the transfer matrix to obtain a transfer matrix
- the sum of the number of occurrences of the string and the preset initial value is calculated, and the calculated sum value is used as the value of the element corresponding to the string in the transfer matrix to obtain a transfer matrix.
- the processor 1201 is caused by machine executable instructions to specifically implement:
- the color depth of the image command grid corresponding to each element in the transfer matrix is calculated according to the value of each element in the transfer matrix, and the target image data corresponding to the file to be identified is obtained.
- the processor 1201 is caused by machine executable instructions to specifically implement:
- the color depth of the image cell corresponding to the first element is determined according to the calculated ratio.
- the first character string is a character string corresponding to the first element in the transfer matrix.
- the processor 1201 is caused by machine executable instructions to specifically implement:
- the transition probability of the first element is determined according to the following formula:
- the calculated transition probability of the first element is determined as the color depth of the image cell corresponding to the first element.
- the processor 1201 is caused by machine executable instructions to specifically implement:
- the target image data is input into the pre-trained CNN model to obtain the characteristics of the target image data; wherein the CNN model is based on the classical CNN Lenet-5 model, the first convolutional layer includes 32 convolution kernels, and the second convolution The layer consists of 64 convolution kernels, with a 0.25 DropOut layer behind the second pooled layer and a 0.5 DropOut layer behind the first fully connected layer.
- the processor 1201 is caused by machine executable instructions to specifically implement:
- the feature of the target image data is input into the pre-trained DNN model to obtain an output result; wherein the DNN model is used to identify the file by using the feature of the image data, determine whether the file corresponding to the image data is a malicious file, and output the result indicating the file to be identified. Whether it is a malicious file.
- the feature of the target image data is an output result of a preset layer of the CNN model
- processor 1201 is prompted by the machine executable instructions to be specifically implemented:
- the preset malicious file feature library includes: a feature of the image data corresponding to the plurality of sample malicious files;
- the file to be identified is a security file.
- the feature of the image data corresponding to the plurality of sample malicious files may be obtained by inputting the image data corresponding to the sample malicious file into the CNN model for each sample malicious file, and the CNN model is The preset layer corresponds to the output result as a feature of the corresponding image data.
- the network device may further include: a communication interface 1203 and a communication bus 1204; wherein the processor 1201, the machine readable storage medium 1202, and the communication interface 1203 complete each other through the communication bus 1204.
- the communication interface 1203 is used for communication between the above network device and other devices.
- the file to be identified when file identification is performed, the file to be identified is converted into image data, and the feature of the image data is extracted, and according to the extracted feature, whether the file to be identified is a malicious file is determined.
- the image data is characterized by the objective existence of the file to be identified, and is not set according to experience, and the file recognition result is obtained according to the objective existence feature, which reduces the dependence of the file recognition on the subjective factors of the person and improves the file recognition. The accuracy.
- the embodiment of the present application further provides a network device, as shown in FIG. 13, including a processor 1301 and a machine readable storage medium 1302, which are stored and executable by the processor 1301.
- Machine executable instructions The processor 1301 is caused by machine executable instructions to implement the file identification method illustrated in FIG. 7 above. Specifically, the processor 1301 is caused to be implemented by machine executable instructions:
- the file identification model is configured to: determine a plurality of character strings corresponding to the input file according to the preset reading rule and the preset phrase model; construct a transfer matrix according to the plurality of character strings; and use the element and the string type in the transfer matrix Corresponding to; determining the target image data corresponding to the input file according to the elements in the transfer matrix; extracting features of the target image data, and determining whether the input file is a malicious file according to the characteristics of the target image data.
- the above-mentioned character string type is a type of a character string, and the types of character strings acquired are different depending on different reading rules and/or phrase models.
- the processor 1301 is caused by machine executable instructions to specifically implement:
- the input file is read according to a preset reading rule to obtain a plurality of characters; according to the preset phrase model, adjacent characters of the plurality of characters are combined to obtain a plurality of character strings.
- the processor 1301 is caused by machine executable instructions to specifically implement:
- the processor 1301 is caused by machine executable instructions to specifically implement:
- the number of occurrences of the string is used as the value of the element corresponding to the string in the transfer matrix to obtain a transfer matrix
- the sum of the number of occurrences of the string and the preset initial value is calculated, and the calculated sum value is used as the value of the element corresponding to the string in the transfer matrix to obtain a transfer matrix.
- the processor 1301 is caused by machine executable instructions to specifically implement:
- the color depth of the image command grid corresponding to each element in the transfer matrix is calculated according to the value of each element in the transfer matrix, and the target image data corresponding to the input file is obtained.
- the processor 1301 is caused by machine executable instructions to specifically implement:
- the color depth of the image cell corresponding to the first element is determined according to the calculated ratio.
- the first character string is a character string corresponding to the first element in the transfer matrix.
- the processor 1301 is caused by machine executable instructions to specifically implement:
- the transition probability of the first element is determined according to the following formula:
- the calculated transition probability of the first element is determined as the color depth of the image cell corresponding to the first element.
- the processor 1301 is caused by machine executable instructions to specifically implement:
- the CNN model is based on the classic CNN Lenet-5 model.
- the first convolutional layer consists of 32 convolution kernels
- the second convolutional layer consists of 64 convolution kernels
- the second pooled layer adds 0.25.
- a DropOut layer 0.5 is added after the first fully connected layer.
- the processor 1301 is caused by machine executable instructions to specifically implement:
- the feature of the target image data is input into the pre-trained DNN model to obtain an output result; wherein the DNN model is used to identify the file by using the feature of the image data to determine whether the file corresponding to the image data is a malicious file, and the output result indicates whether the input file is For malicious files.
- the feature of the target image data is an output result of a preset layer of the CNN model
- processor 1301 is prompted by the machine executable instructions to be specifically implemented:
- the preset malicious file feature library includes: a feature of the image data corresponding to the plurality of sample malicious files;
- the feature of the image data corresponding to the plurality of sample malicious files may be obtained by inputting the image data corresponding to the sample malicious file into the CNN model for each sample malicious file, and the CNN model is The preset layer corresponds to the output result as a feature of the corresponding image data.
- the network device may further include: a communication interface 1303 and a communication bus 1304; wherein the processor 1301, the machine readable storage medium 1302, and the communication interface 1303 complete each other through the communication bus 1304.
- the communication interface 1303 is used for communication between the above network device and other devices.
- the file to be identified when file identification is performed, the file to be identified is converted into image data, and the feature of the image data is extracted, and according to the extracted feature, whether the file to be identified is a malicious file is determined.
- the image data is characterized by the objective existence of the file to be identified, and is not set according to experience, and the file recognition result is obtained according to the objective existence feature, which reduces the dependence of the file recognition on the subjective factors of the person and improves the file recognition. The accuracy.
- the embodiment of the present application further provides a network device, as shown in FIG. 14, including a processor 1401 and a machine readable storage medium 1402, which are stored and executable by the processor 1401.
- Machine executable instructions The processor 1401 is caused by machine executable instructions to implement the feature extraction method illustrated in FIG. 8 above. Specifically, the processor 1401 is caused to be implemented by machine executable instructions:
- the plurality of sample files are respectively input into the file recognition model; wherein the file recognition model is configured to: determine, according to the preset reading rule and the preset phrase model, a plurality of character strings corresponding to the input file; and according to the plurality of strings corresponding to the input file , constructing a transfer matrix, the elements in the transfer matrix correspond one-to-one with the type of the string; determining the target image data corresponding to the input file according to the elements in the transfer matrix; extracting the characteristics of the target image data corresponding to the input file by using the CNN model, and utilizing The DNN model identifies the characteristics of the target image data to determine whether the input file is a malicious file;
- the output of the preset layer of the CNN model is extracted as a feature of the sample file.
- the above-mentioned character string type is a type of a character string, and the types of character strings acquired are different depending on different reading rules and/or phrase models.
- the above uses the DNN model to identify the characteristics of the target image data, and determines whether the input file is a malicious file, that is, the DNN model uses the characteristics of the image data to identify the input file, and determines whether the input file is a malicious file.
- the processor 1401 is caused by machine executable instructions to specifically implement:
- the input file is read according to a preset reading rule to obtain a plurality of characters; according to the preset phrase model, adjacent characters of the plurality of characters are combined to obtain a plurality of character strings.
- the processor 1401 is caused by machine executable instructions to specifically implement:
- the processor 1401 is caused by machine executable instructions to specifically implement:
- the number of occurrences of the string is used as the value of the element corresponding to the string in the transfer matrix to obtain a transfer matrix
- the sum of the number of occurrences of the string and the preset initial value is calculated, and the calculated sum value is used as the value of the element corresponding to the string in the transfer matrix to obtain a transfer matrix.
- the processor 1401 is caused by machine executable instructions to specifically implement:
- the color depth of the image command grid corresponding to each element in the transfer matrix is calculated according to the value of each element in the transfer matrix, and the target image data corresponding to the input file is obtained.
- the processor 1401 is caused by machine executable instructions to specifically implement:
- the color depth of the image cell corresponding to the first element is determined according to the calculated ratio.
- the first character string is a character string corresponding to the first element in the transfer matrix.
- the processor 1401 is caused by machine executable instructions to specifically implement:
- the transition probability of the first element is determined according to the following formula:
- the calculated transition probability of the first element is determined as the color depth of the image cell corresponding to the first element.
- the CNN model is based on the classical CNN Lenet-5 model
- the first convolutional layer includes 32 convolution kernels
- the second convolutional layer includes 64 convolution kernels
- the second A 0.25 DropOut layer is added behind the pooling layer
- a DropOut layer of 0.5 is added after the first fully connected layer.
- the sample file is a sample malicious file
- the processor 1401 is caused by the machine executable instructions to further implement: extracting, for each sample file, an output result of the preset layer of the CNN model, and as a feature of the sample file, constructing a malicious file according to the acquired multiple features.
- Feature Library is caused by the machine executable instructions to further implement: extracting, for each sample file, an output result of the preset layer of the CNN model, and as a feature of the sample file, constructing a malicious file according to the acquired multiple features.
- the machine executable instructions may further include: a second identification instruction
- the processor 1401 is prompted by the machine executable instructions to: input the file to be identified into the file recognition model; obtain the output result of the preset layer of the CNN model in the file recognition model as the target feature; and find the target feature from the malicious file feature database If found, it is determined that the file to be identified is a malicious file; if not found, it is determined that the file to be identified is a security file.
- the network device may further include: a communication interface 1403 and a communication bus 1404; wherein the processor 1401, the machine readable storage medium 1402, and the communication interface 1403 complete each other through the communication bus 1404.
- Inter-communication, communication interface 1403 is used for communication between the above network device and other devices.
- the feature of the preset layer output of the CNN model in the recognition model obtained by the pre-training of the file is extracted, and the feature of extracting the file is not required to be manually analyzed, thereby improving the efficiency of feature extraction and reducing the labor cost.
- a malicious file feature library is constructed based on the extracted features of the malicious file, and the identified file is identified based on the malicious file feature library. Since the features included in the malicious file feature library are directly extracted from the malicious file, if the feature of the file to be identified matches the feature in the malicious file feature database, the file to be identified can be determined as a malicious file, which improves the accuracy of the file identification. Sex. In addition, compared with the calculation amount of the DNN model identification feature, the feature calculation in the matching malicious file feature library is much smaller, and the efficiency of file recognition is improved.
- the communication bus may be a Peripheral Component Interconnect (PCI) bus or an Extended Industry Standard Architecture (EISA) bus.
- PCI Peripheral Component Interconnect
- EISA Extended Industry Standard Architecture
- the communication bus can be divided into an address bus, a data bus, a control bus, and the like.
- the machine readable storage medium may include a random access memory (English: Random Access Memory, RAM for short), and may also include a non-volatile memory (Non-Volatile Memory, NVM for short), such as at least one disk storage. . Additionally, the machine readable storage medium can also be at least one storage device located remotely from the aforementioned processor.
- the processor may be a general-purpose processor, including a central processing unit (English: Central Processing Unit, CPU for short), a network processor (English: Network Processor, NP for short), or a digital signal processor (English: Digital Signal Processing (DSP), Application Specific Integrated Circuit (ASIC), Field-Programmable Gate Array (FPGA) or other programmable logic devices. Discrete gate or transistor logic, discrete hardware components.
- CPU Central Processing Unit
- NP Network Processor
- DSP Digital Signal Processing
- ASIC Application Specific Integrated Circuit
- FPGA Field-Programmable Gate Array
- the embodiment of the present application further provides a machine readable storage medium storing machine executable instructions.
- the machine executable instructions When being called and executed by a processor, the machine executable instructions cause the processor to implement the foregoing FIG. 1 .
- File identification method Specifically, machine executable instructions cause the processor to implement:
- the above-mentioned character string type is a type of a character string, and the types of character strings acquired are different depending on different reading rules and/or phrase models.
- the file to be identified when file identification is performed, the file to be identified is converted into image data, and the feature of the image data is extracted, and according to the extracted feature, whether the file to be identified is a malicious file is determined.
- the image data is characterized by the objective existence of the file to be identified, and is not set according to experience, and the file recognition result is obtained according to the objective existence feature, which reduces the dependence of the file recognition on the subjective factors of the person and improves the file recognition. The accuracy.
- the embodiment of the present application further provides a machine readable storage medium storing machine executable instructions.
- the machine executable instructions When being called and executed by a processor, the machine executable instructions cause the processor to implement the foregoing FIG. File identification method. Specifically, machine executable instructions cause the processor to implement:
- the file identification model is configured to: determine a plurality of character strings corresponding to the input file according to the preset reading rule and the preset phrase model; construct a transfer matrix according to the plurality of character strings; and use the element and the string type in the transfer matrix Corresponding to; determining the target image data corresponding to the input file according to the elements in the transfer matrix; extracting features of the target image data, and determining whether the input file is a malicious file according to the characteristics of the target image data.
- the above-mentioned character string type is a type of a character string, and the types of character strings acquired are different depending on different reading rules and/or phrase models.
- the file to be identified when file identification is performed, the file to be identified is converted into image data, and the feature of the image data is extracted, and according to the extracted feature, whether the file to be identified is a malicious file is determined.
- the image data is characterized by the objective existence of the file to be identified, and is not set according to experience, and the file recognition result is obtained according to the objective existence feature, which reduces the dependence of the file recognition on the subjective factors of the person and improves the file recognition. The accuracy.
- the embodiment of the present application further provides a machine readable storage medium storing machine executable instructions.
- the machine executable instructions When being called and executed by a processor, the machine executable instructions cause the processor to implement the foregoing FIG. Feature extraction method. Specifically, machine executable instructions cause the processor to implement:
- the plurality of sample files are respectively input into the file recognition model; wherein the file recognition model is configured to: determine, according to the preset reading rule and the preset phrase model, a plurality of character strings corresponding to the input file; and according to the plurality of strings corresponding to the input file , constructing a transfer matrix, the elements in the transfer matrix correspond one-to-one with the type of the string; determining the target image data corresponding to the input file according to the elements in the transfer matrix; extracting the characteristics of the target image data corresponding to the input file by using the CNN model, and utilizing The DNN model identifies the characteristics of the target image data to determine whether the input file is a malicious file;
- the output of the preset layer of the CNN model is extracted as a feature of the sample file.
- the above-mentioned character string type is a type of a character string, and the types of character strings acquired are different depending on different reading rules and/or phrase models.
- the feature of the preset layer output of the CNN model in the recognition model obtained by the pre-training of the file is extracted, and the feature of extracting the file is not required to be manually analyzed, thereby improving the efficiency of feature extraction and reducing the labor cost.
- a malicious file feature library is constructed based on the extracted features of the malicious file, and the identified file is identified based on the malicious file feature library. Since the features included in the malicious file feature library are directly extracted from the malicious file, if the feature of the file to be identified matches the feature in the malicious file feature database, the file to be identified can be determined as a malicious file, which improves the accuracy of the file identification. Sex. In addition, compared with the calculation amount of the DNN model identification feature, the feature calculation in the matching malicious file feature library is much smaller, and the efficiency of file recognition is improved.
- the embodiment of the present application further provides a machine executable instruction that, when called and executed by a processor, causes the processor to implement the file identification method shown in FIG. 1 above.
- machine executable instructions cause the processor to implement:
- the above-mentioned character string type is a type of a character string, and the types of character strings acquired are different depending on different reading rules and/or phrase models.
- the file to be identified when file identification is performed, the file to be identified is converted into image data, and the feature of the image data is extracted, and according to the extracted feature, whether the file to be identified is a malicious file is determined.
- the image data is characterized by the objective existence of the file to be identified, and is not set according to experience, and the file recognition result is obtained according to the objective existence feature, which reduces the dependence of the file recognition on the subjective factors of the person and improves the file recognition. The accuracy.
- the embodiment of the present application further provides a machine executable instruction that, when called and executed by a processor, causes the processor to implement the file identification method shown in FIG. 7 above.
- the machine executable instructions cause the processor to: acquire the file to be identified;
- the file identification model is configured to: determine a plurality of character strings corresponding to the input file according to the preset reading rule and the preset phrase model; construct a transfer matrix according to the plurality of character strings; and use the element and the string type in the transfer matrix Corresponding to; determining the target image data corresponding to the input file according to the elements in the transfer matrix; extracting features of the target image data, and determining whether the input file is a malicious file according to the characteristics of the target image data.
- the above-mentioned character string type is a type of a character string, and the types of character strings acquired are different depending on different reading rules and/or phrase models.
- the file to be identified when file identification is performed, the file to be identified is converted into image data, and the feature of the image data is extracted, and according to the extracted feature, whether the file to be identified is a malicious file is determined.
- the image data is characterized by the objective existence of the file to be identified, and is not set according to experience, and the file recognition result is obtained according to the objective existence feature, which reduces the dependence of the file recognition on the subjective factors of the person and improves the file recognition. The accuracy.
- the embodiment of the present application further provides a machine executable instruction that, when invoked and executed by a processor, causes the processor to implement the feature extraction method shown in FIG. 8 above.
- machine executable instructions cause the processor to implement:
- the plurality of sample files are respectively input into the file recognition model; wherein the file recognition model is configured to: determine, according to the preset reading rule and the preset phrase model, a plurality of character strings corresponding to the input file; and according to the plurality of strings corresponding to the input file , constructing a transfer matrix, the elements in the transfer matrix correspond one-to-one with the type of the string; determining the target image data corresponding to the input file according to the elements in the transfer matrix; extracting the characteristics of the target image data corresponding to the input file by using the CNN model, and utilizing The DNN model identifies the characteristics of the target image data to determine whether the input file is a malicious file;
- the output of the preset layer of the CNN model is extracted as a feature of the sample file.
- the above-mentioned character string type is a type of a character string, and the types of character strings acquired are different depending on different reading rules and/or phrase models.
- the feature of the preset layer output of the CNN model in the recognition model obtained by the pre-training of the file is extracted, and the feature of extracting the file is not required to be manually analyzed, thereby improving the efficiency of feature extraction and reducing the labor cost.
- a malicious file feature library is constructed based on the extracted features of the malicious file, and the identified file is identified based on the malicious file feature library. Since the features included in the malicious file feature library are directly extracted from the malicious file, if the feature of the file to be identified matches the feature in the malicious file feature database, the file to be identified can be determined as a malicious file, which improves the accuracy of the file identification. Sex. In addition, compared with the calculation amount of the DNN model identification feature, the feature calculation in the matching malicious file feature library is much smaller, which improves the efficiency of file recognition.
Landscapes
- Engineering & Computer Science (AREA)
- Computer Security & Cryptography (AREA)
- Software Systems (AREA)
- Computer Hardware Design (AREA)
- General Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Virology (AREA)
- Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- General Health & Medical Sciences (AREA)
- Image Analysis (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
Description
Claims (28)
- 一种文件识别方法,所述方法包括:获取待识别文件;根据预设读取规则和预设词组模型,确定所述待识别文件对应的多个字符串;根据所述多个字符串,构建转移矩阵;其中,所述转移矩阵中的元素与字符串种类一一对应;根据所述转移矩阵中的元素,确定所述待识别文件对应的目标图像数据;提取所述目标图像数据的特征,并根据所述目标图像数据的特征,确定所述待识别文件是否为恶意文件。
- 根据权利要求1所述的方法,所述根据预设读取规则和预设词组模型,确定所述待识别文件对应的多个字符串,包括:按照预设读取规则读取所述待识别文件,得到多个字符;按照预设词组模型,组合所述多个字符中相邻的字符,得到多个字符串。
- 根据权利要求1所述的方法,所述根据所述多个字符串,构建转移矩阵,包括:确定每一字符串在所述多个字符串中的出现次数;根据每一字符串的出现次数,构建转移矩阵。
- 根据权利要求3所述的方法,所述根据每一字符串的出现次数,构建转移矩阵,包括:针对每一字符串,将该字符串的出现次数作为转移矩阵中该字符串对应的元素的值,得到所述转移矩阵;或者,针对每一字符串,计算该字符串的出现次数与预设初始值的和值,将计算得到的和值作为转移矩阵中该字符串对应的元素的值,得到所述转移矩阵。
- 根据权利要求1所述的方法,所述根据所述转移矩阵中的元素,确定所述待识别文件对应的目标图像数据,包括:根据所述转移矩阵中各元素的值,计算所述转移矩阵中各元素对应的图像单元格的颜色深度,得到所述待识别文件对应的目标图像数据。
- 根据权利要求5所述的方法,所述根据所述转移矩阵中各元素的值,计算所述转移矩阵中各元素对应的图像单元格的颜色深度,包括:针对所述转移矩阵中的第一元素,确定所述第一元素的值为第一数值;其中,所述第一元素为所述转移矩阵中的任一元素,所述第一元素的值根据所述第一元素对应的第一字符串的出现次数确定;确定所有第二元素的值之和为第二数值;其中,所述第二元素的值根据第二字符串的出现次数确定,所述第二字符串的头部词与所述第一字符串的头部词相同;计算所述第一数值与所述第二数值的比值;根据计算得到的比值,确定所述第一元素对应的图像单元格的颜色深度。
- 根据权利要求6所述的方法,所述根据计算得到的比值,确定所述第一元素对应的图像单元格的颜色深度,包括:针对所述第一元素,根据以下公式确定所述第一元素的转移概率:h=Log T;其中,h为所述第一元素的转移概率,T为计算得到的比值;将计算得到的所述第一元素的转移概率,确定为所述第一元素对应的图像单元格的颜色深度。
- 根据权利要求1所述的方法,所述提取所述目标图像数据的特征,包括:将所述目标图像数据输入预先训练的卷积神经网络CNN模型,得到所述目标图像数据的特征;其中,所述CNN模型以经典CNN Lenet-5模型为基础,第一个卷积层包括32个卷积核,第二个卷积层包括64个卷积核,第二个池化层后面增加0.25的丢弃DropOut层,第一个全连接层后面增加0.5的DropOut层。
- 根据权利要求8所述的方法,所述根据所述目标图像数据的特征,确定所述待识别文件是否为恶意文件,包括:将所述目标图像数据的特征输入预先训练的深度神经网络DNN模型,得到输出结果;其中,所述DNN模型用于对图像数据的特征进行识别,确定图像数据对应的文件是否为恶意文件,所述输出结果指示所述待识别文件是否为恶意文件。
- 根据权利要求8所述的方法,所述目标图像数据的特征为所述CNN模型的预设层的输出结果;所述根据所述目标图像数据的特征,确定所述待识别文件是否为恶意文件,包括:从预设恶意文件特征库中查找所述目标图像数据的特征;所述预设恶意文件特征库包括:多个样本恶意文件对应的图像数据的特征;若查找到,则确定所述待识别文件为恶意文件;若未查找到,则确定所述待识别文件为安全文件。
- 一种文件识别方法,所述方法包括:获取待识别文件;将所述待识别文件输入预先训练的文件识别模型,确定所述待识别文件是否为恶意文件;其中,所述文件识别模型用于:根据预设读取规则和预设词组模型,确定输入文件对应的多个字符串;根据所述多个字符串,构建转移矩阵,所述转移矩阵中的元素与字符串种类一一对应;根据所述转移矩阵中的元素,确定所述输入文件对应的目标图像数据;提取所述目标图像数据的特征,并根据所述目标图像数据的特征,确定所述输入文件是否为恶意文件。
- 一种特征提取方法,所述方法包括:将多个样本文件分别输入文件识别模型;其中,所述文件识别模型用于:根据预设读取规则和预设词组模型,确定输入文件对应的多个字符串;根据所述多个字符串,构建转移矩阵,所述转移矩阵中的元素与字符串种类一一对应;根据所述转移矩阵中的元素,确定所述输入文件对应的目标图像数据;利用卷积神经网络CNN模型提取所述目标图像数据的特征;利用深度神经网络DNN模型对所述目标图像数据的特征进行识别,确定所述输入文件是否为恶意文件;针对每一样本文件,提取所述CNN模型的预设层的输出结果,作为该样本文件的特征。
- 根据权利要求12所述的方法,所述样本文件为样本恶意文件;在针对每一样本文件,提取所述CNN模型的预设层的输出结果,作为该样本文件的特征之后,还包括:根据提取的多个特征,构建恶意文件特征库。
- 根据权利要求13所述的方法,还包括:将待识别文件输入所述文件识别模型;获取所述CNN模型的预设层的输出结果,作为目标特征;从所述恶意文件特征库中查找所述目标特征;若查找到,则确定所述待识别文件为恶意文件;若未查找到,则确定所述待识别文件为安全文件。
- 一种网络设备,包括处理器和机器可读存储介质,所述机器可读存储介质存储有能够被所述处理器执行的机器可执行指令,所述处理器被所述机器可执行指令促使实现:获取待识别文件;根据预设读取规则和预设词组模型,确定所述待识别文件对应的多个字符串;根据所述多个字符串,构建转移矩阵;其中,所述转移矩阵中的元素与字符串种类一一对应;根据所述转移矩阵中的元素,确定所述待识别文件对应的目标图像数据;提取所述目标图像数据的特征,并根据所述目标图像数据的特征,确定所述待识别文件是否为恶意文件。
- 根据权利要求15所述的网络设备,所述处理器被所述机器可执行指令促使具体实现:按照预设读取规则读取所述待识别文件,得到多个字符;按照预设词组模型,组合所述多个字符中相邻的字符,得到多个字符串。
- 根据权利要求15所述的网络设备,所述处理器被所述机器可执行指令促使具体实现:确定每一字符串在所述多个字符串中的出现次数;根据每一字符串的出现次数,构建转移矩阵。
- 根据权利要求17所述的网络设备,所述处理器被所述机器可执行指令促使具体实现:针对每一字符串,将该字符串的出现次数作为转移矩阵中该字符串对应的元素的值,得到所述转移矩阵;或者,针对每一字符串,计算该字符串的出现次数与预设初始值的和值,将计算得到的和值作为转移矩阵中该字符串对应的元素的值,得到所述转移矩阵。
- 根据权利要求15所述的网络设备,所述处理器被所述机器可执行指令促使具体实现:根据所述转移矩阵中各元素的值,计算所述转移矩阵中各元素对应的图像单元格的颜色深度,得到所述待识别文件对应的目标图像数据。
- 根据权利要求19所述的网络设备,所述处理器被所述机器可执行指令促使具体实现:针对所述转移矩阵中的第一元素,确定所述第一元素的值为第一数值;其中,所述第一元素为所述转移矩阵中的任一元素,所述第一元素的值根据所述第一元素对应的第一字符串的出现次数确定;确定所有第二元素的值之和为第二数值;其中,所述第二元素的值根据第二字符串的出现次数确定,所述第二字符串的头部词与所述第一字符串的头部词相同;计算所述第一数值与所述第二数值的比值;根据计算得到的比值,确定所述第一元素对应的图像单元格的颜色深度。
- 根据权利要求20所述的网络设备,所述处理器被所述机器可执行指令促使具体实现:针对所述第一元素,根据以下公式确定所述第一元素的转移概率:h=Log T;其中,h为所述第一元素的转移概率,T为计算得到的比值;将计算得到的所述第一元素的转移概率,确定为所述第一元素对应的图像单元格的颜色深度。
- 根据权利要求15所述的网络设备,所述处理器被所述机器可执行指令促使具体实现:将所述目标图像数据输入预先训练的卷积神经网络CNN模型,得到所述目标图像数据的特征;其中,所述CNN模型以经典CNN Lenet-5模型为基础,第一个卷积层包括32个卷积核,第二个卷积层包括64个卷积核,第二个池化层后面增加0.25的丢弃DropOut层,第一个全连接层后面增加0.5的DropOut层。
- 根据权利要求22所述的网络设备,所述处理器被所述机器可执行指令促使具体实现:将所述目标图像数据的特征输入预先训练的深度神经网络DNN模型,得到输出结果;其中,所述DNN模型用于对图像数据的特征进行识别,确定图像数据对应的文件是否为恶意文件,所述输出结果指示所述待识别文件是否为恶意文件。
- 根据权利要求22所述的网络设备,所述目标图像数据的特征为所述CNN模型的预设层的输出结果;所述处理器被所述机器可执行指令促使具体实现:所述根据所述目标图像数据的特征,确定所述待识别文件是否为恶意文件,包括:从预设恶意文件特征库中查找所述目标图像数据的特征;所述预设恶意文件特征库包括:多个样本恶意文件对应的图像数据的特征;若查找到,则确定所述待识别文件为恶意文件;若未查找到,则确定所述待识别文件为安全文件。
- 一种网络设备,包括处理器和机器可读存储介质,所述机器可读存储介质存储有能够被所述处理器执行的机器可执行指令,所述处理器被所述机器可执行指令促使实现:获取待识别文件;将所述待识别文件输入预先训练的文件识别模型,确定所述待识别文件是否为恶意文件;其中,所述文件识别模型用于:根据预设读取规则和预设词组模型,确定输入文件对应的多个字符串;根据所述多个字符串,构建转移矩阵,所述转移矩阵中的元素与字符串种类一一对应;根据所述转移矩阵中的元素,确定所述输入文件对应的目标图像数据;提取所述目标图像数据的特征,并根据所述目标图像数据的特征,确定所述输入文件是否为恶意文件。
- 一种网络设备,包括处理器和机器可读存储介质,所述机器可读存储介质存储有能够被所述处理器执行的机器可执行指令,所述处理器被所述机器可执行指令促使实现:将多个样本文件分别输入文件识别模型;其中,所述文件识别模型用于:根据预设读取规则和预设词组模型,确定输入文件对应的多个字符串;根据所述多个字符串,构建转移矩阵,所述转移矩阵中的元素与字符串种类一一对应;根据所述转移矩阵中的元素,确定所述输入文件对应的目标图像数据;利用卷积神经网络CNN模型提取所述目标图像数据的特征,并利用深度神经网络DNN模型对所述目标图像数据的特征进行识别,确定所述输入文件是否为恶意文件;针对每一样本文件,提取所述CNN模型的预设层的输出结果,作为该样本文件的特征。
- 根据权利要求26所述的网络设备,所述样本文件为样本恶意文件;所述处理器被所述机器可执行指令促使实现:在针对每一样本文件,提取所述CNN模型的预设层的输出结果,作为该样本文件的特征之后,根据提取的多个特征,构建恶意文件特征库。
- 根据权利要求26所述的网络设备,所述处理器被所述机器可执行指令促使实现:将待识别文 件输入所述文件识别模型;获取所述CNN模型的预设层的输出结果,作为目标特征;从所述恶意文件特征库中查找所述目标特征;若查找到,则确定所述待识别文件为恶意文件;若未查找到,则确定所述待识别文件为安全文件。
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810349458.6 | 2018-04-18 | ||
CN201810349458.6A CN109753987B (zh) | 2018-04-18 | 2018-04-18 | 文件识别方法和特征提取方法 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2019201295A1 true WO2019201295A1 (zh) | 2019-10-24 |
Family
ID=66402373
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2019/083200 WO2019201295A1 (zh) | 2018-04-18 | 2019-04-18 | 文件识别方法和特征提取方法 |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN109753987B (zh) |
WO (1) | WO2019201295A1 (zh) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111079528A (zh) * | 2019-11-07 | 2020-04-28 | 国网辽宁省电力有限公司电力科学研究院 | 一种基于深度学习的图元图纸校核方法及系统 |
CN111582282A (zh) * | 2020-05-13 | 2020-08-25 | 科大讯飞股份有限公司 | 一种文本识别方法、装置、设备及存储介质 |
CN113949582A (zh) * | 2021-10-25 | 2022-01-18 | 绿盟科技集团股份有限公司 | 一种网络资产的识别方法、装置、电子设备及存储介质 |
Families Citing this family (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110516125B (zh) * | 2019-08-28 | 2020-05-08 | 拉扎斯网络科技(上海)有限公司 | 识别异常字符串的方法、装置、设备及可读存储介质 |
CN111222856A (zh) * | 2020-01-15 | 2020-06-02 | 深信服科技股份有限公司 | 一种邮件识别方法、装置、设备及存储介质 |
CN111310205B (zh) * | 2020-02-11 | 2024-05-10 | 平安科技(深圳)有限公司 | 敏感信息的检测方法、装置、计算机设备和存储介质 |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101046858A (zh) * | 2006-03-29 | 2007-10-03 | 腾讯科技(深圳)有限公司 | 电子信息比较系统和方法以及反垃圾邮件系统 |
CN104216875A (zh) * | 2014-09-26 | 2014-12-17 | 中国科学院自动化研究所 | 基于非监督关键二元词串提取的微博文本自动摘要方法 |
CN107392019A (zh) * | 2017-07-05 | 2017-11-24 | 北京金睛云华科技有限公司 | 一种恶意代码家族的训练和检测方法及装置 |
Family Cites Families (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8499354B1 (en) * | 2011-03-15 | 2013-07-30 | Symantec Corporation | Preventing malware from abusing application data |
CN104751055B (zh) * | 2013-12-31 | 2017-11-03 | 北京启明星辰信息安全技术有限公司 | 一种基于纹理的分布式恶意代码检测方法、装置及系统 |
CN105095755A (zh) * | 2015-06-15 | 2015-11-25 | 安一恒通(北京)科技有限公司 | 文件识别方法和装置 |
-
2018
- 2018-04-18 CN CN201810349458.6A patent/CN109753987B/zh active Active
-
2019
- 2019-04-18 WO PCT/CN2019/083200 patent/WO2019201295A1/zh active Application Filing
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101046858A (zh) * | 2006-03-29 | 2007-10-03 | 腾讯科技(深圳)有限公司 | 电子信息比较系统和方法以及反垃圾邮件系统 |
CN104216875A (zh) * | 2014-09-26 | 2014-12-17 | 中国科学院自动化研究所 | 基于非监督关键二元词串提取的微博文本自动摘要方法 |
CN107392019A (zh) * | 2017-07-05 | 2017-11-24 | 北京金睛云华科技有限公司 | 一种恶意代码家族的训练和检测方法及装置 |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111079528A (zh) * | 2019-11-07 | 2020-04-28 | 国网辽宁省电力有限公司电力科学研究院 | 一种基于深度学习的图元图纸校核方法及系统 |
CN111582282A (zh) * | 2020-05-13 | 2020-08-25 | 科大讯飞股份有限公司 | 一种文本识别方法、装置、设备及存储介质 |
CN111582282B (zh) * | 2020-05-13 | 2024-04-12 | 科大讯飞股份有限公司 | 一种文本识别方法、装置、设备及存储介质 |
CN113949582A (zh) * | 2021-10-25 | 2022-01-18 | 绿盟科技集团股份有限公司 | 一种网络资产的识别方法、装置、电子设备及存储介质 |
CN113949582B (zh) * | 2021-10-25 | 2023-05-30 | 绿盟科技集团股份有限公司 | 一种网络资产的识别方法、装置、电子设备及存储介质 |
Also Published As
Publication number | Publication date |
---|---|
CN109753987B (zh) | 2021-08-06 |
CN109753987A (zh) | 2019-05-14 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2019201295A1 (zh) | 文件识别方法和特征提取方法 | |
WO2020221298A1 (zh) | 文本检测模型训练方法、文本区域、内容确定方法和装置 | |
TWI682325B (zh) | 辨識系統及辨識方法 | |
JP6541673B2 (ja) | モバイル機器におけるリアルタイム音声評価システム及び方法 | |
WO2020073664A1 (zh) | 指代消解方法、电子装置及计算机可读存储介质 | |
CN108647736B (zh) | 一种基于感知损失和匹配注意力机制的图像分类方法 | |
US20110314294A1 (en) | Password checking | |
CN104156349B (zh) | 基于统计词典模型的未登录词发现和分词系统及方法 | |
JP2020505643A (ja) | 音声認識方法、電子機器、及びコンピュータ記憶媒体 | |
WO2021208727A1 (zh) | 基于人工智能的文本错误检测方法、装置、计算机设备 | |
WO2021031825A1 (zh) | 网络欺诈识别方法、装置、计算机装置及存储介质 | |
CN111835763B (zh) | 一种dns隧道流量检测方法、装置及电子设备 | |
CN115380284A (zh) | 非结构化文本分类 | |
CN114050912B (zh) | 一种基于深度强化学习的恶意域名检测方法和装置 | |
WO2019238125A1 (zh) | 信息处理方法、相关设备及计算机存储介质 | |
CN116956835B (zh) | 一种基于预训练语言模型的文书生成方法 | |
WO2019201024A1 (zh) | 用于更新模型参数的方法、装置、设备和存储介质 | |
WO2021082861A1 (zh) | 评分方法、装置、电子设备及存储介质 | |
US11227110B1 (en) | Transliteration of text entry across scripts | |
JP7149976B2 (ja) | 誤り訂正方法及び装置、コンピュータ読み取り可能な媒体 | |
US20160063336A1 (en) | Generating Weights for Biometric Tokens in Probabilistic Matching Systems | |
WO2021253938A1 (zh) | 一种神经网络的训练方法、视频识别方法及装置 | |
CN114826681A (zh) | 一种dga域名检测方法、系统、介质、设备及终端 | |
CN113239683A (zh) | 中文文本纠错方法、系统及介质 | |
CN115858776B (zh) | 一种变体文本分类识别方法、系统、存储介质和电子设备 |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 19787885 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 19787885 Country of ref document: EP Kind code of ref document: A1 |
|
32PN | Ep: public notification in the ep bulletin as address of the adressee cannot be established |
Free format text: NOTING OF LOSS OF RIGHTS (EPO FORM 1205A DATED 04.05.2021) |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 19787885 Country of ref document: EP Kind code of ref document: A1 |