CN110879888A - Virus file detection method, device and equipment - Google Patents

Virus file detection method, device and equipment Download PDF

Info

Publication number
CN110879888A
CN110879888A CN201911122399.XA CN201911122399A CN110879888A CN 110879888 A CN110879888 A CN 110879888A CN 201911122399 A CN201911122399 A CN 201911122399A CN 110879888 A CN110879888 A CN 110879888A
Authority
CN
China
Prior art keywords
virus
file
detected
word segmentation
classifier
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201911122399.XA
Other languages
Chinese (zh)
Inventor
王春磊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
New H3C Big Data Technologies Co Ltd
Original Assignee
New H3C Big Data Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by New H3C Big Data Technologies Co Ltd filed Critical New H3C Big Data Technologies Co Ltd
Priority to CN201911122399.XA priority Critical patent/CN110879888A/en
Publication of CN110879888A publication Critical patent/CN110879888A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/50Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
    • G06F21/55Detecting local intrusion or implementing counter-measures
    • G06F21/56Computer malware detection or handling, e.g. anti-virus arrangements
    • G06F21/561Virus type analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computer Security & Cryptography (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Molecular Biology (AREA)
  • Mathematical Physics (AREA)
  • Computational Linguistics (AREA)
  • Biophysics (AREA)
  • Virology (AREA)
  • Computer Hardware Design (AREA)
  • Biomedical Technology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Image Analysis (AREA)

Abstract

The application provides a virus file detection method, a virus file detection device and virus file detection equipment. The method comprises the following steps: performing word segmentation processing on a character string represented by a file to be detected to obtain word segmentation characteristics of the file to be detected and a characteristic matrix of the file to be detected; for each element in the feature matrix, converting the value of the element into a gray value to obtain a gray image to be detected corresponding to the file to be detected; inputting the gray-scale image to be detected into a trained virus classifier; and determining whether the file to be detected is a virus file or not according to the classification result output by the virus classifier. Therefore, file identification is converted into image identification, and virus files which are slightly changed or upgraded are identified by utilizing the characteristic of high identification accuracy of the classifier, so that the omission probability of the virus files is reduced.

Description

Virus file detection method, device and equipment
Technical Field
The present application relates to the field of network communication technologies, and in particular, to a method, an apparatus, and a device for detecting a virus file.
Background
The harm of virus files to computers is not insignificant. Such as illegal acquisition of computer privileges, illegal access to private computers, illegal control of computer resources, hijacking of user assets, etc. In order to protect against virus files, it is necessary to identify the virus files.
At present, the methods for detecting virus files mainly include the following two methods:
the method I comprises the steps of extracting partial texts or character strings from virus samples to serve as feature codes, and storing the feature codes into a virus library. When a file to be detected is received, extracting the feature codes of the file in the same extraction mode, and comparing the feature codes with the feature codes in the virus library. And if the consistent feature codes exist, determining that the file to be detected is a virus file.
And secondly, carrying out hash operation on the virus sample, and storing the hash value into a virus library. And when the file to be detected is received, executing the same hash operation on the file to be detected. And comparing the hash value corresponding to the file to be detected with the hash value in the virus library. And if the consistent hash value exists, determining that the file to be detected is a virus file.
However, the two methods can identify the virus file only under the condition that the file to be detected is completely consistent with the virus sample. If the file to be detected is a file which is slightly modified or upgraded on the basis of the known virus sample, the existing detection method cannot identify the virus file, so that the detection is missed.
Content of application
In view of this, the present application provides a method, an apparatus, and a device for detecting a virus file, so as to reduce the probability of missing detection of the virus file.
In order to achieve the purpose of the application, the application provides the following technical scheme:
in a first aspect, the present application provides a method for detecting a virus file, the method comprising:
performing word segmentation processing on a character string represented by a file to be detected to obtain word segmentation characteristics of the file to be detected and a characteristic matrix of the file to be detected;
for each element in the feature matrix, converting the value of the element into a gray value to obtain a gray image to be detected corresponding to the file to be detected;
inputting the gray-scale image to be detected into a trained virus classifier;
and determining whether the file to be detected is a virus file or not according to the classification result output by the virus classifier.
Optionally, before inputting the grayscale image to be detected into the trained virus classifier, the method further includes:
dividing a virus sample set into a training sample set and a testing sample set, wherein the virus sample set comprises a plurality of known virus samples;
training the deep learning model by using the virus samples in the training sample set to obtain a virus classifier;
verifying the classification accuracy of the virus classifier by using the virus samples in the test sample set for the virus classifier obtained by training;
and if the classification accuracy reaches a preset accuracy threshold, determining that the virus classifier is trained.
Optionally, the method further includes:
and if the classification accuracy rate does not reach the preset accuracy rate threshold value, selecting a part of virus samples from the test sample set to continue training the deep learning model until the classification accuracy rate of the trained virus classifier reaches the preset accuracy rate threshold value.
Optionally, the word segmentation processing is performed on the character string represented by the file to be detected to obtain the word segmentation characteristics of the file to be detected and the characteristic matrix of the file to be detected, and the word segmentation processing includes:
dividing the character string represented by the file to be detected into N word segmentation features according to the principle that a preset number of characters are divided into one word segmentation feature, and two adjacent word segmentation features in the character string do not comprise characters in the same position, wherein N is a positive integer;
and constructing a feature matrix of the file to be detected based on the N word segmentation features.
Optionally, the converting the value of the element into a gray value includes:
and based on the gray value range, carrying out normalization processing on the values of the elements to obtain corresponding gray values.
In a second aspect, the present application provides a virus file detection apparatus, the apparatus comprising:
the word segmentation unit is used for performing word segmentation processing on the character string represented by the file to be detected to obtain word segmentation characteristics of the file to be detected and a characteristic matrix of the file to be detected;
the conversion unit is used for converting the value of each element in the characteristic matrix into a gray value to obtain a to-be-detected gray map corresponding to the to-be-detected file;
the input unit is used for inputting the gray-scale image to be detected into the trained virus classifier;
and the first determining unit is used for determining whether the file to be detected is a virus file or not according to the classification result output by the virus classifier.
Optionally, the apparatus further comprises:
the device comprises a dividing unit, a judging unit and a judging unit, wherein the dividing unit is used for dividing a virus sample set into a training sample set and a testing sample set, and the virus sample set comprises a plurality of known virus samples;
the training unit is used for training the deep learning model by using the virus samples in the training sample set to obtain a virus classifier;
the verification unit is used for verifying the classification accuracy of the virus classifier by using the virus samples in the test sample set for the virus classifier obtained by training;
and the second determining unit is used for determining that the virus classifier is trained if the classification accuracy reaches a preset accuracy threshold.
Optionally, the training unit is further configured to select a part of the virus samples from the test sample set to continue training the deep learning model if the classification accuracy does not reach a preset accuracy threshold, until the classification accuracy of the trained virus classifier reaches the preset accuracy threshold.
Optionally, the word segmentation unit performs word segmentation on the character string represented by the file to be detected to obtain the word segmentation characteristics of the file to be detected and the characteristic matrix of the file to be detected, and the word segmentation includes:
dividing the character string represented by the file to be detected into N word segmentation features according to the principle that a preset number of characters are divided into one word segmentation feature, and two adjacent word segmentation features in the character string do not comprise characters in the same position, wherein N is a positive integer;
and constructing a feature matrix of the file to be detected based on the N word segmentation features.
Optionally, the converting unit converts the values of the elements into gray values, and includes:
and based on the gray value range, carrying out normalization processing on the values of the elements to obtain corresponding gray values.
In a third aspect, the present application provides an apparatus comprising a processor and a machine-readable storage medium storing machine-executable instructions executable by the processor, the processor being caused by the machine-executable instructions to implement the virus file detection method described above.
In a fourth aspect, the present application provides a machine-readable storage medium having stored therein machine-executable instructions, which when executed by a processor, implement the above-mentioned virus file detection method.
From the above description, it can be seen that in the application, the file identification is converted into the image identification, and the virus file slightly changed or upgraded can be identified by utilizing the characteristic of high identification accuracy of the classifier, so that the omission probability of the virus file is reduced.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
FIG. 1 is a flowchart illustrating a method for detecting a virus file according to an embodiment of the present application;
FIG. 2 is a flowchart illustrating an implementation of training a virus classifier according to an embodiment of the present disclosure;
FIG. 3 is a flowchart illustrating an implementation of step 101 according to an embodiment of the present application;
FIG. 4 is a schematic structural diagram of a virus file detection apparatus according to an embodiment of the present application;
fig. 5 is a schematic diagram of a hardware structure of an apparatus according to an embodiment of the present application.
Detailed Description
Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present application. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present application, as detailed in the appended claims.
The terminology used in the embodiments of the present application is for the purpose of describing particular embodiments only and is not intended to be limiting of the embodiments of the present application. As used in the examples of this application and the appended claims, the singular forms "a", "an", and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any and all possible combinations of one or more of the associated listed items.
It should be understood that although the terms first, second, third, etc. may be used in the embodiments of the present application to describe various information, the information should not be limited to these terms. These terms are only used to distinguish one type of information from another. For example, the negotiation information may also be referred to as second information, and similarly, the second information may also be referred to as negotiation information without departing from the scope of the embodiments of the present application. The word "if" as used herein may be interpreted as "at … …" or "when … …" or "in response to a determination", depending on the context.
The embodiment of the application provides a virus file detection method. In the method, file identification is converted into image identification, and virus files which are slightly changed or upgraded can be identified by utilizing the characteristic of high identification accuracy of the classifier, so that the omission probability of the virus files is reduced.
In order to make the objects, technical solutions and advantages of the embodiments of the present application more clear, the embodiments of the present application are described in detail below with reference to the accompanying drawings and specific embodiments:
referring to fig. 1, a flowchart of a virus file detection method according to an embodiment of the present application is shown. The process can be applied to equipment needing virus defense. Such as personal computers, servers, etc. The application is not limited to a particular type of device.
As shown in fig. 1, the process may include the following steps:
step 101, performing word segmentation processing on the character string represented by the file to be detected to obtain word segmentation characteristics of the file to be detected and a characteristic matrix of the file to be detected.
In order to protect against virus files, virus detection needs to be performed on the received files. Here, the received file is referred to as a file to be detected.
The files to be detected are usually in the form of strings, for example, the file "8 a345D673AB3043D4a220D 5F" in the form of hexadecimal strings.
The method comprises the following steps of carrying out word segmentation processing on a character string represented by a file to be detected to obtain word segmentation characteristics of the file to be detected and a characteristic matrix of the file to be detected. In one example, the elements in the feature matrix of the document to be detected are word segmentation features of the document to be detected.
The specific process of word segmentation processing involved in this step is described below, and is not described herein for the time being.
And 102, converting the value of each element in the characteristic matrix into a gray value to obtain a gray image to be detected corresponding to the file to be detected.
Here, the elements of the feature matrix are the word segmentation features obtained after the word segmentation processing in step 101.
In the step, the value of each element is converted into a corresponding gray value, and a gray map corresponding to the file to be detected is obtained. Here, the gray scale image corresponding to the document to be detected is referred to as a gray scale image to be detected.
The process of converting the value of the element into the gray value in this step is described below, and will not be described herein again.
And 103, inputting the gray-scale image to be detected into the trained virus classifier.
The virus classifier is a classifier trained on known virus samples. The process of training the virus separator is described below, and is not repeated here.
In this step, the gray-scale image to be detected obtained in step 102 is input into the trained virus classifier, and the classification result output by the virus classifier can be obtained.
And step 104, determining whether the file to be detected is a virus file or not according to the classification result output by the virus classifier.
The classification result output by the virus classifier can directly indicate whether the file to be detected is a virus file or not. For example, when the classification result is a first value, the file to be detected is represented as a virus file; and when the classification result is a second value, the file to be detected is not a virus file.
Thus, the flow shown in fig. 1 is completed.
As can be seen from the flow shown in fig. 1, in the embodiment of the present application, file identification is converted into image identification, and a virus file that is slightly changed or upgraded can be identified by using the characteristic of high identification accuracy of a classifier, so that the probability of missed detection of the virus file is reduced.
The process of training the virus classifier is described below. Referring to fig. 2, an implementation process of training a virus classifier is shown in an embodiment of the present application.
As shown in fig. 2, the process may include the following steps:
step 201, dividing a virus sample set into a training sample set and a testing sample set.
Here, the virus sample set includes a plurality of known virus samples.
And 202, training the deep learning model by using the virus samples in the training sample set to obtain a virus classifier.
The deep learning is a complex machine learning, and has a good recognition effect in the aspects of voice and image recognition. The typical deep learning model is a Convolutional Neural Network (CNN) model.
As an embodiment, the virus classifier can be obtained by training a convolutional neural network model by using virus samples in a training sample set.
Specifically, each virus sample in the training sample set is converted into a corresponding virus sample gray-scale map. The conversion process can refer to the aforementioned processing process (step 101 and step 102) for forwarding the file to be detected into the grayscale image to be detected, and details are not described here.
And setting the convolutional neural network model. For example, the convolutional neural network model is set to include two convolutional layers, and the convolutional cores of each convolutional layer are 32 and 64, respectively. The activation function of the convolutional neural network model is the Relu activation function.
And inputting the gray level image of each virus sample obtained by conversion into the set convolutional neural network model. In the convolutional neural network model, the gray-scale maps of the virus samples are firstly processed by a convolutional layer to obtain virus sample information sequences, and then the virus sample information sequences are input into a plurality of LSTM (Long Short-Term Memory) units to obtain virus characteristics. Inputting the extracted virus characteristics into a mean value pooling layer for smoothing, inputting the smoothed virus characteristics into a Dropout layer for dimension reduction, and finally inputting a Sigmoid function and outputting a classification result. The internal processing of the convolutional neural network model is prior art and will not be described in detail herein.
And training the convolutional neural network model by using all virus samples in the training sample set to obtain the virus classifier.
And step 203, verifying the classification accuracy of the virus classifier by using the virus samples in the test sample set for the virus classifier obtained by training.
After the virus classifier is obtained through step 202, the classification accuracy of the virus classifier needs to be verified.
Therefore, in the embodiment of the application, each virus sample in the test sample set is input into the trained virus classifier, and whether the virus classifier is classified correctly is determined according to the classification result output by the virus classifier.
For example, a virus file is input into a virus classifier, and if the classification result output by the virus classifier is a virus file, the classification is determined to be correct; and if the output classification result is a normal file, determining that the classification is wrong.
The embodiment of the application counts the number of the correctly classified virus samples, and takes the ratio of the number of the correctly classified virus samples to the total number of the virus samples in the test sample set as the classification accuracy of the virus classifier.
And 204, if the classification accuracy reaches a preset accuracy threshold, determining that the virus classifier is trained and meets the use requirement.
Thus, the flow shown in fig. 2 is completed.
As can be seen from the flow shown in fig. 2, in the embodiment of the present application, the virus classifier is trained by using the training sample set, and the classification accuracy of the virus classifier is verified by using the test sample set. After the virus classifier passes the verification, the virus classifier is used for identifying the virus file, so that the accuracy of virus file identification is ensured.
As an embodiment, if the classification accuracy obtained in step 203 does not reach the preset accuracy threshold, it indicates that the classification accuracy of the virus classifier obtained through the training in step 202 is low, and the accuracy of virus file identification cannot be guaranteed. Therefore, in the embodiment of the application, part of the virus samples can be selected from the test sample set, and the deep learning model is continuously trained until the classification accuracy of the trained virus classifier reaches the preset accuracy threshold, so that the training of the virus classifier is completed.
In addition, when a new virus which cannot be identified by the virus classifier is found, the new virus sample can be added, and the virus classifier is retrained, so that the virus classifier can identify the new virus. Compared with the prior art, the virus characteristic needs to be manually extracted and the virus characteristic library needs to be maintained, and the automatic update of the virus classifier can be realized in the embodiment of the application.
The following describes a process of performing word segmentation processing on the character string represented by the file to be detected in step 101. Referring to fig. 3, a flow of implementing step 101 is shown in the embodiment of the present application.
As shown in fig. 3, the process may include the following steps:
step 301, dividing a character string represented by a file to be detected into N word segmentation features according to a principle that a preset number of characters are divided into one word segmentation feature and two adjacent word segmentation features in the character string do not include characters in the same position.
Here, N is a positive integer.
As an example, the preset number may be 2. Take the file "8 a345D673AB3043D4a220D 5F" represented by hexadecimal character string as an example. Dividing each 2 hexadecimal characters into a word segmentation feature, wherein two word segmentation features adjacent in position do not comprise characters at the same position, and the word segmentation features obtained after division are as follows: 8A, 34, 5D, 67, 3A, B3, 04, 3D, 4A, 22, 0D, 5F.
As an example, the preset number may be 3. The file "8 a345D673AB3043D4a220D 5F" represented by a hexadecimal string is still taken as an example. Dividing each 3 hexadecimal characters into a word segmentation feature, wherein two word segmentation features adjacent in position do not comprise characters at the same position, and the word segmentation features obtained after division are as follows: 8A3, 45D, 673, AB3, 043, D4A, 220, D5F.
The above two examples are merely illustrative, and the present application does not limit the preset number.
Here, it should be noted that the existing word segmentation principle generally includes characters in the same position in two adjacent word segmentation features. Still taking the file "8 a345D673AB3043D4a220D 5F" represented by hexadecimal character string as an example, dividing 2 hexadecimal characters into a word segmentation feature, the following word segmentation features can be obtained: 8A, A3, 34, 45, 5D, D6, 67, 73, 3A, AB, B3, 30, 04, 43, 3D, D4, 4A, A2, 22, 20, 0D, D5, 5F.
The inventor finds that:
by combining the method provided by the application, the accuracy comparison result of the file identification is shown in table 1 by using the word segmentation characteristics divided by the existing word segmentation principle and the word segmentation characteristics divided by the word segmentation principle in the embodiment of the application.
Figure BDA0002275793600000101
TABLE 1
It can be seen that under the condition that the same file information is contained and the word segmentation characteristics are divided by the same number of characters, the division mode of the method can acquire fewer word segmentation characteristics under the condition that the recognition accuracy is hardly influenced, so that the operation complexity is reduced, and the word segmentation efficiency is improved.
And 302, constructing a feature matrix of the file to be detected based on the N word segmentation features.
In the embodiment of the application, the structure and the operation complexity of the deep learning model can be comprehensively considered, and the size of the feature matrix is preset, for example, the size of the feature matrix is preset to be M rows × K columns.
If the number N of the word segmentation features obtained in step 301 is greater than mxk, mxk word segmentation features may be selected from the N word segmentation features, for example, the previous mxk word segmentation features are selected to form a feature matrix of M rows × K columns.
Taking the 12 word segmentation features 8A, 34, 5D, 67, 3A, B3, 04, 3D, 4A, 22, 0D, and 5F obtained in step 301 as an example, the preset feature matrix size is 3 × 3, and the first 9 word segmentation features are selected from the word segmentation features to form a3 × 3 feature matrix shown in table 2.
8A 34 5D
67 3A B3
04 3D 4A
TABLE 2
If the number N of the word segmentation features obtained in step 301 is smaller than M × K, 0 may be complemented to form a feature matrix of M rows × K columns.
Taking the 8 word segmentation features 8a3, 45D, 673, AB3, 043, D4A, 220, and D5F obtained in step 301 as an example, the size of the preset feature matrix is 3 × 3, and if the number of the word segmentation features is smaller than the size of the feature matrix, a word segmentation feature 0 is supplemented to form a3 × 3 feature matrix shown in table 3.
8A3 45D 673
AB3 043 D4A
220 D5F 0
TABLE 3
The flow shown in fig. 3 is completed.
The process shown in fig. 3 is used to implement word segmentation processing, and obtain a feature matrix corresponding to the document to be detected.
Next, a process of converting the value of each element in the feature matrix into a gray scale value in step 102 to obtain a gray scale image to be detected corresponding to the file to be detected is described.
As can be seen from the foregoing description, the elements in the feature matrix are word segmentation features. Based on different word segmentation principles, the value ranges of the obtained word segmentation characteristics (elements) are different. For example, based on the principle that 2 hexadecimal characters are divided into a word segmentation feature, the decimal value range of the obtained word segmentation feature is 0-255; based on the principle that 3 hexadecimal characters are divided into a word segmentation characteristic, the decimal value range of the obtained word segmentation characteristic is 0-4095.
Therefore, in the embodiment of the application, the value of each element (word segmentation feature) is normalized according to the gray value range (0-255), so as to obtain the gray value corresponding to each pixel point. In one example, the normalization formula is:
G=F/H×D
wherein G represents a gray value; f is a decimal number corresponding to the value of the element; h is the maximum value of the decimal value range of the element + 1; d is the maximum value of the gray scale value range + 1.
Taking the feature matrix shown in table 2 as an example, each element in the feature matrix is composed of 2 hexadecimal characters, the corresponding decimal value range is 0-255, and is the same as the gray value range, so that the value of each element can be directly converted into the gray value corresponding to each pixel point. After conversion, as shown in table 4.
138 52 93
103 58 179
4 61 74
TABLE 4
Taking the feature matrix shown in table 3 as an example, each element in the feature matrix is composed of 3 hexadecimal characters, the corresponding decimal value range is 0-4095, and the gray value range is 0-255, so the value of each element needs to be normalized and converted into a gray value within the range of 0-255.
Taking the first element 8a3 in table 3 as an example, and the corresponding decimal number is 2211, the gray value of the pixel point corresponding to the element is: 2211/4096 × 256 is 138. By analogy, the gray value of the pixel point corresponding to each element is obtained, as shown in table 5.
138 70 103
171 4 213
34 214 0
TABLE 5
And at this point, converting the file to be detected into a corresponding gray-scale image to be detected.
In order to describe the method provided by the embodiment of the present application, the following describes the apparatus provided by the embodiment of the present application:
referring to fig. 4, a schematic structural diagram of an apparatus provided in an embodiment of the present application is shown. The device includes: a word segmentation unit 401, a conversion unit 402, an input unit 403, and a first determination unit 404, wherein:
the word segmentation unit 401 is configured to perform word segmentation on a character string represented by a file to be detected to obtain word segmentation characteristics of the file to be detected and a characteristic matrix of the file to be detected;
a converting unit 402, configured to convert, for each element in the feature matrix, a value of the element into a gray scale value, so as to obtain a to-be-detected gray scale map corresponding to the to-be-detected file;
an input unit 403, configured to input the grayscale image to be detected into a trained virus classifier;
a first determining unit 404, configured to determine whether the file to be detected is a virus file according to the classification result output by the virus classifier.
As an embodiment, the apparatus further comprises:
the device comprises a dividing unit, a judging unit and a judging unit, wherein the dividing unit is used for dividing a virus sample set into a training sample set and a testing sample set, and the virus sample set comprises a plurality of known virus samples;
the training unit is used for training the deep learning model by using the virus samples in the training sample set to obtain a virus classifier;
the verification unit is used for verifying the classification accuracy of the virus classifier by using the virus samples in the test sample set for the virus classifier obtained by training;
and the second determining unit is used for determining that the virus classifier is trained if the classification accuracy reaches a preset accuracy threshold.
As an embodiment, the training unit is further configured to, if the classification accuracy does not reach a preset accuracy threshold, select a part of the virus samples from the test sample set to continue training the deep learning model until the classification accuracy of the trained virus classifier reaches the preset accuracy threshold.
As an embodiment, the word segmentation unit 401 performs word segmentation on the character string represented by the file to be detected to obtain the word segmentation characteristics of the file to be detected and the characteristic matrix of the file to be detected, and includes:
dividing the character string represented by the file to be detected into N word segmentation features according to the principle that a preset number of characters are divided into one word segmentation feature, and two adjacent word segmentation features in the character string do not comprise characters in the same position, wherein N is a positive integer;
and constructing a feature matrix of the file to be detected based on the N word segmentation features.
As an embodiment, the converting unit 402 converts the value of the element into a gray value, including:
and based on the gray value range, carrying out normalization processing on the values of the elements to obtain corresponding gray values.
The description of the apparatus shown in fig. 4 is thus completed. In the embodiment of the application, the file identification is converted into the image identification, and the virus file slightly changed or upgraded can be identified by utilizing the characteristic of high identification accuracy of the classifier, so that the omission probability of the virus file is reduced.
The following describes the apparatus provided in the embodiment of the present invention:
fig. 5 is a schematic diagram of a hardware structure of an apparatus according to an embodiment of the present invention. The apparatus may include a processor 501, a machine-readable storage medium 502 having stored thereon machine-executable instructions. The processor 501 and the machine-readable storage medium 502 may communicate via a system bus 503. Also, the processor 501 may perform the virus file detection method described above by reading and executing machine-executable instructions in the machine-readable storage medium 502 corresponding to the virus file detection logic.
The machine-readable storage medium 502 referred to herein may be any electronic, magnetic, optical, or other physical storage device that can contain or store information such as executable instructions, data, and the like. For example, the machine-readable storage medium 502 may include at least one of the following storage media: volatile memory, non-volatile memory, other types of storage media. The volatile Memory may be a Random Access Memory (RAM), and the nonvolatile Memory may be a flash Memory, a storage drive (e.g., a hard disk drive), a solid state disk, and a storage disk (e.g., a compact disk, a DVD).
Embodiments of the present invention also provide a machine-readable storage medium, such as machine-readable storage medium 502 in fig. 5, comprising machine-executable instructions that are executable by processor 501 in a device to implement the virus file detection method described above.
So far, the description of the apparatus shown in fig. 5 is completed.
The above description is only a preferred embodiment of the present application, and should not be taken as limiting the present application, and any modifications, equivalents, improvements, etc. made within the spirit and principle of the present application shall be included in the scope of the present application.

Claims (12)

1. A method for virus file detection, the method comprising:
performing word segmentation processing on a character string represented by a file to be detected to obtain word segmentation characteristics of the file to be detected and a characteristic matrix of the file to be detected;
for each element in the feature matrix, converting the value of the element into a gray value to obtain a gray image to be detected corresponding to the file to be detected;
inputting the gray-scale image to be detected into a trained virus classifier;
and determining whether the file to be detected is a virus file or not according to the classification result output by the virus classifier.
2. The method of claim 1, wherein before inputting the gray scale image to be detected into the trained virus classifier, the method further comprises:
dividing a virus sample set into a training sample set and a testing sample set, wherein the virus sample set comprises a plurality of known virus samples;
training the deep learning model by using the virus samples in the training sample set to obtain a virus classifier;
verifying the classification accuracy of the virus classifier by using the virus samples in the test sample set for the virus classifier obtained by training;
and if the classification accuracy reaches a preset accuracy threshold, determining that the virus classifier is trained.
3. The method of claim 2, wherein the method further comprises:
and if the classification accuracy rate does not reach the preset accuracy rate threshold value, selecting a part of virus samples from the test sample set to continue training the deep learning model until the classification accuracy rate of the trained virus classifier reaches the preset accuracy rate threshold value.
4. The method of claim 1, wherein the performing word segmentation on the character string represented by the file to be detected to obtain word segmentation characteristics of the file to be detected and a characteristic matrix of the file to be detected comprises:
dividing the character string represented by the file to be detected into N word segmentation features according to the principle that a preset number of characters are divided into one word segmentation feature, and two adjacent word segmentation features in the character string do not comprise characters in the same position, wherein N is a positive integer;
and constructing a feature matrix of the file to be detected based on the N word segmentation features.
5. The method of claim 1, wherein said converting the value of the element to a grayscale value comprises:
and based on the gray value range, carrying out normalization processing on the values of the elements to obtain corresponding gray values.
6. A virus file detection apparatus, comprising:
the word segmentation unit is used for performing word segmentation processing on the character string represented by the file to be detected to obtain word segmentation characteristics of the file to be detected and a characteristic matrix of the file to be detected;
the conversion unit is used for converting the value of each element in the characteristic matrix into a gray value to obtain a to-be-detected gray map corresponding to the to-be-detected file;
the input unit is used for inputting the gray-scale image to be detected into the trained virus classifier;
and the first determining unit is used for determining whether the file to be detected is a virus file or not according to the classification result output by the virus classifier.
7. The apparatus of claim 6, wherein the apparatus further comprises:
the device comprises a dividing unit, a judging unit and a judging unit, wherein the dividing unit is used for dividing a virus sample set into a training sample set and a testing sample set, and the virus sample set comprises a plurality of known virus samples;
the training unit is used for training the deep learning model by using the virus samples in the training sample set to obtain a virus classifier;
the verification unit is used for verifying the classification accuracy of the virus classifier by using the virus samples in the test sample set for the virus classifier obtained by training;
and the second determining unit is used for determining that the virus classifier is trained if the classification accuracy reaches a preset accuracy threshold.
8. The apparatus of claim 7, wherein:
and the training unit is also used for selecting partial virus samples from the test sample set to continue training the deep learning model if the classification accuracy rate does not reach a preset accuracy rate threshold value until the classification accuracy rate of the trained virus classifier reaches the preset accuracy rate threshold value.
9. The apparatus according to claim 6, wherein the word segmentation unit performs word segmentation on the character string represented by the file to be detected to obtain word segmentation characteristics of the file to be detected and a characteristic matrix of the file to be detected, and includes:
dividing the character string represented by the file to be detected into N word segmentation features according to the principle that a preset number of characters are divided into one word segmentation feature, and two adjacent word segmentation features in the character string do not comprise characters in the same position, wherein N is a positive integer;
and constructing a feature matrix of the file to be detected based on the N word segmentation features.
10. The apparatus of claim 6, wherein the conversion unit converts the values of the elements into grayscale values, comprising:
and based on the gray value range, carrying out normalization processing on the values of the elements to obtain corresponding gray values.
11. A device comprising a processor and a machine-readable storage medium storing machine-executable instructions executable by the processor, the processor being caused by the machine-executable instructions to: carrying out the method steps of any one of claims 1 to 5.
12. A machine-readable storage medium having stored therein machine-executable instructions which, when executed by a processor, perform the method steps of any of claims 1-5.
CN201911122399.XA 2019-11-15 2019-11-15 Virus file detection method, device and equipment Pending CN110879888A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911122399.XA CN110879888A (en) 2019-11-15 2019-11-15 Virus file detection method, device and equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911122399.XA CN110879888A (en) 2019-11-15 2019-11-15 Virus file detection method, device and equipment

Publications (1)

Publication Number Publication Date
CN110879888A true CN110879888A (en) 2020-03-13

Family

ID=69729112

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911122399.XA Pending CN110879888A (en) 2019-11-15 2019-11-15 Virus file detection method, device and equipment

Country Status (1)

Country Link
CN (1) CN110879888A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113051569A (en) * 2021-03-31 2021-06-29 深信服科技股份有限公司 Virus detection method and device, electronic equipment and storage medium
CN113553586A (en) * 2021-06-16 2021-10-26 深信服科技股份有限公司 Virus detection method, model training method, device, equipment and storage medium
CN114332700A (en) * 2021-12-24 2022-04-12 中国电子信息产业集团有限公司第六研究所 Network virus classification method and device, electronic equipment and storage medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108804919A (en) * 2018-05-03 2018-11-13 上海交通大学 The homologous determination method of malicious code based on deep learning
CN109165688A (en) * 2018-08-28 2019-01-08 暨南大学 A kind of Android Malware family classification device construction method and its classification method
CN110096878A (en) * 2019-04-26 2019-08-06 武汉智美互联科技有限公司 A kind of detection method of Malware

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108804919A (en) * 2018-05-03 2018-11-13 上海交通大学 The homologous determination method of malicious code based on deep learning
CN109165688A (en) * 2018-08-28 2019-01-08 暨南大学 A kind of Android Malware family classification device construction method and its classification method
CN110096878A (en) * 2019-04-26 2019-08-06 武汉智美互联科技有限公司 A kind of detection method of Malware

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113051569A (en) * 2021-03-31 2021-06-29 深信服科技股份有限公司 Virus detection method and device, electronic equipment and storage medium
CN113051569B (en) * 2021-03-31 2024-05-28 深信服科技股份有限公司 Virus detection method and device, electronic equipment and storage medium
CN113553586A (en) * 2021-06-16 2021-10-26 深信服科技股份有限公司 Virus detection method, model training method, device, equipment and storage medium
CN113553586B (en) * 2021-06-16 2024-05-28 深信服科技股份有限公司 Virus detection method, model training method, device, equipment and storage medium
CN114332700A (en) * 2021-12-24 2022-04-12 中国电子信息产业集团有限公司第六研究所 Network virus classification method and device, electronic equipment and storage medium
CN114332700B (en) * 2021-12-24 2023-08-25 中国电子信息产业集团有限公司第六研究所 Network virus classification method and device, electronic equipment and storage medium

Similar Documents

Publication Publication Date Title
US11675903B2 (en) Apparatus for detecting variants of malicious code based on neural network learning, method therefor and computer readable recording medium storing program for performing the method
CN109302410B (en) Method and system for detecting abnormal behavior of internal user and computer storage medium
CN110879888A (en) Virus file detection method, device and equipment
CN111753290B (en) Software type detection method and related equipment
CN109491914B (en) High-impact defect report prediction method based on unbalanced learning strategy
CN112183099A (en) Named entity identification method and system based on semi-supervised small sample extension
CN111881289B (en) Training method of classification model, and detection method and device of data risk class
CN112950445B (en) Compensation-based detection feature selection method in image steganalysis
CN104750791A (en) Image retrieval method and device
CN111062036A (en) Malicious software identification model construction method, malicious software identification medium and malicious software identification equipment
CN112560971A (en) Image classification method and system for active learning self-iteration
CN112149754A (en) Information classification method, device, equipment and storage medium
CN117557872B (en) Unsupervised anomaly detection method and device for optimizing storage mode
CN110705622A (en) Decision-making method and system and electronic equipment
CN111783088B (en) Malicious code family clustering method and device and computer equipment
CN109101984B (en) Image identification method and device based on convolutional neural network
CN113360911A (en) Malicious code homologous analysis method and device, computer equipment and storage medium
CN111507420A (en) Tire information acquisition method, tire information acquisition device, computer device, and storage medium
CN111488574A (en) Malicious software classification method, system, computer equipment and storage medium
CN110942073A (en) Container trailer number identification method and device and computer equipment
US20200019606A1 (en) Expression recognition using character skipping
CN115567224A (en) Method for detecting abnormal transaction of block chain and related product
CN108805161B (en) Steganography detection method for multi-embedding-rate encrypted image
US20230145544A1 (en) Neural network watermarking
CN117520104B (en) System for predicting abnormal state of hard disk

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20200313

RJ01 Rejection of invention patent application after publication