CN110619212B - Character string-based malicious software identification method, system and related device - Google Patents

Character string-based malicious software identification method, system and related device Download PDF

Info

Publication number
CN110619212B
CN110619212B CN201810639502.7A CN201810639502A CN110619212B CN 110619212 B CN110619212 B CN 110619212B CN 201810639502 A CN201810639502 A CN 201810639502A CN 110619212 B CN110619212 B CN 110619212B
Authority
CN
China
Prior art keywords
character string
importance
file
software
character strings
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810639502.7A
Other languages
Chinese (zh)
Other versions
CN110619212A (en
Inventor
章明星
位凯志
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sangfor Technologies Co Ltd
Original Assignee
Sangfor Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sangfor Technologies Co Ltd filed Critical Sangfor Technologies Co Ltd
Priority to CN201810639502.7A priority Critical patent/CN110619212B/en
Priority to PCT/CN2019/087563 priority patent/WO2019242443A1/en
Publication of CN110619212A publication Critical patent/CN110619212A/en
Application granted granted Critical
Publication of CN110619212B publication Critical patent/CN110619212B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/50Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
    • G06F21/55Detecting local intrusion or implementing counter-measures
    • G06F21/56Computer malware detection or handling, e.g. anti-virus arrangements
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/50Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
    • G06F21/55Detecting local intrusion or implementing counter-measures
    • G06F21/56Computer malware detection or handling, e.g. anti-virus arrangements
    • G06F21/562Static detection
    • G06F21/563Static detection by source code analysis

Abstract

The application discloses a character string-based malware identification method, which utilizes a TFIDF technology to carry out importance degree statistics on each character string extracted from a PE file contained in software to be detected, selects a high-importance character string with high classification capability and identification degree from the character strings, and then completes malware identification by taking the high-importance character string as a malware identification feature. According to the identification method, the identification of the malicious content is completed without manual work by means of self experience, the automatic completion efficiency is higher, the identification of the malicious content is more accurate, and the omission probability is lower. The application also discloses a system, equipment and a computer readable storage medium for identifying the malicious software based on the character string, and the beneficial effects are achieved.

Description

Character string-based malicious software identification method, system and related device
Technical Field
The present application relates to the field of malware identification, and in particular, to a method, a system, an apparatus, and a computer-readable storage medium for identifying malware based on a character string.
Background
With the continuous development of computer programming technology, software obtained based on various computer language programming also enables people to complete various tasks and works in a computer more conveniently, but malicious software carrying malicious contents also appears along with the software, and maliciously attacks normal data files or steals other people's labor achievements. Therefore, it is important to identify whether the software to be tested is malware.
In the existing method for identifying malicious software, a person skilled in the art usually distinguishes a huge number of codes or character strings contained in software to be detected manually by virtue of own experience, some malicious software is manufactured based on normal software, namely, a large number of useless codes or character strings exist, a very small number of malicious contents are submerged in a sea formed by the huge useless codes or character strings, and even some malicious software avoids identification and can disguise the malicious contents. Therefore, the traditional manual identification method has extremely low efficiency and is very unstable, and meanwhile, the discrimination method based on experience cannot accurately identify new malicious contents or malicious contents which are difficult to find, and the actual effect is not good.
Therefore, how to overcome various technical defects of the existing malware identification method which is manually completed is to provide a malware identification method which does not need to be manually identified by technicians in the field, has higher efficiency and more accurate malware identification, and is a problem to be solved by the technicians in the field.
Disclosure of Invention
The method comprises the steps of carrying out importance degree statistics on each character string extracted from a PE file contained in software to be tested by utilizing a TFIDF technology, selecting a high-importance character string with high classification capability and recognition degree from the character strings, and completing the recognition of malicious software by using the high-importance character string as a malicious software recognition feature.
Another object of the present application is to provide a system, an apparatus, and a computer-readable storage medium for character string-based malware recognition.
In order to achieve the above object, the present application provides a method for identifying malware based on a character string, the method including:
calculating the importance degree of each character string extracted from the PE file contained in the software to be tested on each PE file by using a TFIDF technology to obtain an importance evaluation parameter;
screening corresponding character strings according to the importance evaluation parameters according to importance degrees to obtain a first preset number of high-importance character strings, and taking each high-importance character string as a malicious software identification feature; wherein the size of the importance evaluation parameter is proportional to the importance degree of the character string;
and identifying the malicious software of the software to be tested by utilizing the identification characteristics of the malicious software.
Optionally, the TFIDF technology is used to calculate the importance degree of each character string extracted from the PE file included in the software to be tested to each PE file, so as to obtain importance evaluation parameters, where the importance evaluation parameters include:
by using
Figure BDA0001701976340000021
Calculating the importance degree of each character string extracted from the PE file contained in the software to be tested on each PE file to obtain an importance evaluation parameter;
the TF is the occurrence frequency of the character string in each PE file, the IDF is the number of all PE files containing the character string in the software to be tested, and the N is the total number of the PE files contained in the software to be tested.
Optionally, the importance evaluation parameter is used to filter the corresponding character strings according to the importance degree, so as to obtain a first preset number of high-importance character strings, including:
arranging all the importance evaluation parameters from top to bottom according to a descending order to obtain an importance evaluation parameter arrangement queue;
and selecting the importance evaluation parameters of the first preset number from the importance evaluation parameter arrangement queue from top to bottom, and taking the corresponding character strings as the high-importance character strings.
Optionally, the method further includes:
merging the PE files contained in the software to be tested in the same type according to different types contained in the malicious software to obtain a PE file set after merging of all types;
respectively calculating the importance degree of each character string to each PE file in the PE file set after the character strings are merged in each category by utilizing the TFIDF technology to obtain the merged importance parameters;
correspondingly selecting a second preset number of high-importance character strings according to the merged importance parameters from large to small, and using the high-importance character strings as the identification characteristics of the malicious software; wherein the second preset number is less than the first preset number.
Optionally, before calculating the importance degree of each character string extracted from the PE file included in the software to be tested to each PE file by using the TFIDF technology, the method further includes:
dividing each PE file into a header metadata section, a code section and a data section according to a specific file format forming the PE file;
extracting the content in the head metadata section by using a metadata analyzer to obtain a first character string set;
decompiling the content in the code segment, and extracting a character string containing a preset function name from decompiled data to obtain a second character string set;
extracting character strings including at least one of an IP address, a mailbox address and a website from the content contained in the data segment to obtain a third character string set;
and generating corresponding PE documents by the first character string set, the second character string set and the third character string set extracted from each PE file according to a preset combination format so as to reduce the number of invalid character strings.
Optionally, before the malware identification of the software to be tested is performed by using each malware identification feature, the method further includes:
classifying the identification characteristics of the malicious software by using a theme analysis technology to obtain the identification characteristics after theme classification;
and converting the corresponding PE files into topic distribution vectors according to the topic classification identification characteristics contained in each PE file.
Optionally, the identifying malware of the software to be tested by using each malware identification feature includes:
and classifying each topic distribution vector by utilizing a linear classification algorithm.
In order to achieve the above object, the present application further provides a system for identifying malware based on character strings, including:
the TFIDF calculation unit is used for calculating the importance degree of each character string extracted from the PE file contained in the software to be tested on each PE file by utilizing a TFIDF technology to obtain an importance evaluation parameter;
the malware identification feature selection unit is used for screening the corresponding character strings according to the importance degree according to the importance evaluation parameters to obtain a first preset number of high-importance character strings, and taking each high-importance character string as a malware identification feature; wherein the size of the importance evaluation parameter is proportional to the importance degree of the character string;
and the malicious software identification unit is used for identifying the malicious software of the software to be tested by utilizing the malicious software identification characteristics.
Optionally, the TFIDF calculating unit includes:
a formula calculation subunit for utilizing
Figure BDA0001701976340000041
Calculating the importance degree of each character string extracted from the PE file contained in the software to be tested on each PE file to obtain an importance evaluation parameter; the TF is the occurrence frequency of the character string in each PE file, the IDF is the number of all PE files containing the character string in the software to be tested, and the N is the total number of the PE files contained in the software to be tested.
Optionally, the system further comprises:
the sequence arrangement unit is used for arranging all the importance evaluation parameters from top to bottom according to a descending order to obtain an importance evaluation parameter arrangement queue;
and the first identification feature selection subunit is used for selecting the importance evaluation parameters of the first preset number from the importance evaluation parameter arrangement queue from top to bottom, and taking the corresponding character strings as the high-importance character strings.
Optionally, the system further comprises:
the similar merging subunit is used for performing similar merging on the PE files contained in the software to be tested according to different classes contained in the malicious software to obtain a PE file set after merging of all classes;
the merged parameter calculation subunit is used for respectively calculating the importance degree of each character string in the merged PE file set of each category to each PE file by utilizing the TFIDF technology to obtain a merged importance parameter;
a second identification feature selection subunit, configured to correspondingly select, from a large order to a small order, a second preset number of high-importance character strings from each of the merged importance parameters, and use the selected high-importance character strings as the identification features of the malware; wherein the second preset number is less than the first preset number.
Optionally, the system further comprises:
a PE file content splitting unit for splitting each PE file into a header metadata segment, a code segment, and a data segment according to a specific file format constituting the PE file;
the head content extraction unit is used for extracting the content in the head metadata section by using a metadata parser to obtain a first character string set;
the code segment content extraction unit is used for decompiling the content in the code segment, and extracting a character string containing a preset function name from decompiled data to obtain a second character string set;
a data segment content extraction unit, configured to extract a character string including at least one of an IP address, a mailbox address, and a web address from content included in the data segment, to obtain a third character string set;
and the PE document generating unit is used for generating corresponding PE documents by the first character string set, the second character string set and the third character string set extracted from each PE file according to a preset combination format so as to reduce the number of invalid character strings.
Optionally, the system further comprises:
the theme analysis and classification unit is used for classifying the identification characteristics of the malicious software by using a theme analysis technology to obtain the identification characteristics after theme classification;
and the theme distribution vector generating unit is used for converting the corresponding PE files into theme distribution vectors according to the recognition characteristics after the themes contained in each PE file are classified.
Optionally, the malware identification unit includes:
and the linear classification algorithm classification subunit is used for classifying each topic distribution vector by using a linear classification algorithm.
In order to achieve the above object, the present application further provides a device for identifying malware based on a character string, the device including:
a memory for storing a computer program;
a processor for implementing the steps of the malware identification method as described above when executing the computer program.
To achieve the above object, the present application also provides a computer-readable storage medium having a computer program stored thereon, which, when being executed by a processor, implements the steps of the malware identification method as described above.
The application provides a character string-based malware identification method, which comprises the following steps: calculating the importance degree of each character string extracted from the PE file contained in the software to be tested on each PE file by using a TFIDF technology to obtain an importance evaluation parameter; screening corresponding character strings according to the importance evaluation parameters according to importance degrees to obtain a first preset number of high-importance character strings, and taking each high-importance character string as a malicious software identification feature; wherein the size of the importance evaluation parameter is proportional to the importance degree of the character string; and identifying the malicious software of the software to be tested by utilizing the identification characteristics of the malicious software.
Obviously, according to the technical scheme provided by the application, the application aims to provide a character string-based malware identification method, the TFIDF technology is utilized to perform importance degree statistics on each character string extracted from a PE file contained in software to be detected, high-importance character strings with high classification capability and recognition degree are selected from the character strings, and then the high-importance character strings are used as malware identification features to complete malware identification. The application also provides a system and a device for identifying malicious software based on character strings and a computer readable storage medium, which have the beneficial effects and are not repeated herein.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, it is obvious that the drawings in the following description are only embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.
Fig. 1 is a flowchart of a method for identifying malware based on character strings according to an embodiment of the present disclosure;
fig. 2 is a flowchart illustrating extracting a character string from a PE file in the method for identifying malware based on character strings according to the embodiment of the present application;
fig. 3 is a flowchart of another method for identifying malware based on character strings according to an embodiment of the present disclosure;
fig. 4 is a block diagram of a structure of a malware identification system based on a character string according to an embodiment of the present application.
Detailed Description
The core of the application is to provide a method, a system, a device and a computer readable storage medium for identifying malicious software based on character strings, wherein the TFIDF technology is utilized to carry out importance degree statistics on each character string extracted from a PE file contained in software to be detected, a high importance character string with high classification capability and identification degree is selected from the character strings, and then the high importance character string is used as a malicious software identification characteristic to complete the identification of the malicious software.
In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some embodiments of the present application, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
Example one
With reference to fig. 1, fig. 1 is a flowchart of an embodiment of the present application, and fig. 1 is a flowchart of a method for identifying malware based on character strings according to an embodiment of the present application.
The method specifically comprises the following steps:
s101: calculating the importance degree of each character string extracted from the PE file contained in the software to be tested on each PE file by using a TFIDF technology to obtain an importance evaluation parameter;
TFIDF (term frequency-inverse document frequency) is a statistical method commonly used in the field of natural language processing to evaluate the importance of a word to a document set or a document in a corpus: the importance of the word increases in direct proportion to the number of times it appears in the document, but at the same time decreases in inverse proportion to the frequency with which it appears in the corpus. TF means Term Frequency (Term Frequency), and IDF means Inverse Document Frequency (Inverse Document Frequency).
The main idea of the TFIDF technique is: if a word or phrase appears frequently (TF) in a certain article but rarely in other articles, the word or phrase is considered to have a good classification capability and is suitable for classification. In a given document, TF refers to the frequency with which a given term appears in the document, and this number is a normalization of the number of times the term appears to prevent it from being biased towards a long document (since the same term may have a greater number of times in a long document than in a short document).
For a word in a particular document, its TF may be expressed as:
Figure BDA0001701976340000071
the numerator in the above formula represents the number of occurrences of the word in the specific document, and the denominator represents the sum of the number of occurrences of all words in the specific document;
the IDF of the term may be obtained by dividing the total number of files included in the set of files by the number of files including the term, and then taking the logarithm of the obtained quotient:
Figure BDA0001701976340000081
wherein | D | in the numerator represents a total number of files contained in the set of files; i in denominator { j: ti∈djDenotes the number of all documents containing the term, since there may be special cases where the denominator is zero when the term is not present in the document set, 1+ | { j: t: ] is typically usedi∈djAs denominator.
Finally, calculating the product of TF and IDF, the high word frequency in a specific document, and the low document frequency of the word in the whole document set can generate TFIDF result with high weight. Thus, TFIDF techniques tend to filter out common words, preserving important words with taxonomy.
The application applies the TFIDF technology commonly used in the field of natural language processing to the technical field of malicious software identification, and aims to identify the importance of codes or character strings in each PE file (Portable Executable, common EXE, DLL, OCX, SYS and COM are PE files, and the PE files are program files on a Microsoft Windows operating system) contained in software to be detected, so that high-importance character strings capable of being used as malicious software identification features can be found out based on the importance result.
According to the method, the character strings with strong classification can be found from all the PE files contained in the software to be tested, so that the TFIDF technology is adopted, on one hand, because the application programming language has a natural format when being programmed, the character strings with more occurrence times are probably nonsense character strings (useless for identifying malicious software) under the format requirement, and the character strings can be well screened out by adopting the technology; on the other hand, existing malware is usually made by embedding malicious content based on normal software, and the malicious content has a small data volume, a small number of occurrences of a specific character string, but has strong attack capability, and thus, the detection is avoided.
Therefore, in the present application, aiming at the above phenomena, the TFIDF technology commonly used in the natural language processing field is applied to the technical field of malware recognition to search for a character string with strong classification capability in the software to be tested, and certainly, since the TFIDF technology is applied to different fields, some corresponding changes are required, and detailed descriptions will be provided in subsequent steps.
Further, each character string calculated in this step may be extracted based on the simplest manner, that is, all the contents in the PE file are converted into character strings, but this manner may include a large number of meaningless character strings (which are not useful for malware identification), so that each part constituting the PE file may be extracted in a targeted manner according to the specific file format constituting the PE file, so as to reduce the obtained meaningless character strings and reduce the amount of calculation subsequently using the TFIDF technique.
S102: screening the corresponding character strings according to the importance degree according to the importance evaluation parameters to obtain a first preset number of high-importance character strings, and taking each high-importance character string as a malicious software identification feature;
on the basis of S101, this step is to screen out a character string that meets the requirement according to the size of the calculated importance evaluation parameter, and based on the above calculation manner, the size of the obtained importance evaluation parameter is proportional to the importance degree, so it is sufficient to screen out a high importance character string having a larger importance evaluation parameter from all character strings, and finally, the high importance character string is used as a malware identification feature for the subsequent steps.
The method for screening out the high-importance character strings is various, for example, the importance evaluation parameters of all the character strings are integrated, the gears with different importance degrees are divided according to stages, and the character string in the most important gear is selected; or setting a certain threshold value, and screening out character strings with importance evaluation parameters higher than the threshold value; a sorting queue may also be established according to all the importance evaluation parameters, the sorting principle is from top to bottom in descending order, so that only the character strings corresponding to the first N importance evaluation parameters need to be selected from the queue, and the like.
Furthermore, in order to prevent omission of some character strings which are few in occurrence frequency but have strong index significance, the malicious software can be classified according to different types, for example, the types of Lesson software, Trojan horse and worm are classified even into more detailed families, then the importance degree of the character strings is calculated by respectively utilizing the TFIDF technology aiming at each type of file set, and the character strings with more important evaluation parameters in a single type are also reserved as the malicious software identification features, so that the omission probability is reduced, and the malicious software identification effect is improved.
One of the implementations, including but not limited to, is as follows:
merging the PE files contained in the software to be tested in the same type according to different types contained in the malicious software to obtain a PE file set after merging of all types;
respectively calculating the importance degree of each character string to each PE file in the PE file set after the character strings are merged in each category by using a TFIDF technology to obtain an importance parameter after merging;
correspondingly selecting a second preset number of high-importance character strings according to the merged importance parameters from large to small, and using the high-importance character strings as the identification characteristics of the malicious software; wherein the second predetermined number < the first predetermined number (because the number of files contained in the set of files under the finer classification is smaller than the number of files contained in the set of unclassified original files).
On the basis of screening out low-importance character strings, in order to determine the relationship among the screened high-importance character strings, the high-importance character strings of the same type can be classified, for example, classification is realized by adopting a theme analysis mode, the high-importance character strings of the same type with the same theme can be merged, for example, words such as football, basketball, Mandarin and the like can be classified under the theme of sports, and finally the words are converted into multi-dimensional vectors to represent the distribution of the content of each theme, so that the classification of the high-importance character strings is convenient to determine. Of course, other algorithms such as clustering, K-means, etc. may be used to achieve the same effect.
S103: identifying malicious software of the software to be detected by utilizing the identification characteristics of the malicious software;
on the basis of the step S102, the step aims to realize malware identification of the software to be tested by using the obtained malware identification features. Specifically, the method can be implemented by using a classifier constructed based on each malware identification feature, and the classifier can be constructed based on a linear or non-linear classification algorithm, including but not limited to logistic regression, a support vector machine, a decision tree, and the like.
Based on the technical scheme, according to the character string-based malware identification method provided by the embodiment of the application, the TFIDF technology is utilized to perform importance degree statistics on each character string extracted from a PE file contained in software to be detected, high-importance character strings with high classification capability and recognition degree are selected from the character strings, and then the high-importance character strings are used as malware identification features to complete malware identification.
Example two
With reference to fig. 2, fig. 2 is a flowchart illustrating extracting a character string from a PE file in a method for identifying malware based on a character string according to an embodiment of the present disclosure.
The way of extracting the character string from the PE file includes, but is not limited to, the following ways:
s201: dividing each PE file into a header metadata section, a code section and a data section according to a specific file format forming the PE file;
according to the PE file composition format published by microsoft, the PE file can be roughly divided into a header metadata section, a code section, and a data section, wherein the data in the header metadata section conforms to a more standard format, and the contained information includes description information of the program section (section name, size, authority, etc.); the code segment mainly contains the function realization content constructed based on the programming language, and the data segment mainly contains the visual content data (similar to the body of the letter) presented in front of the user.
Therefore, according to the content and the characteristics included in the parts, useful character strings can be extracted from the parts in a targeted manner, and meaningless character strings can be removed.
S202: extracting contents in the head metadata section by using a metadata analyzer to obtain a first character string set;
because the data in the header metadata section conforms to a more standard format and is very important, all information in the header metadata section can be extracted by a specific parser, and the extracted content is uniformly processed as a character string regardless of whether the extracted content is in a special format such as a number format or not.
S203: decompiling the content in the code segment, and extracting a character string containing a preset function name from decompiled data to obtain a second character string set;
s204: extracting character strings including at least one of an IP address, a mailbox address and a website from the content contained in the data segment to obtain a third character string set;
in contrast, except for the content in the head metadata section, a large number of meaningless character strings are contained in the code section and the data section, if all the meaningless character strings are extracted, the subsequent steps are only interfered, and the calculation time is greatly increased, so that only the character strings in the specified area or the specified type need to be extracted from the two parts of data.
For a code segment, after decompiling, only extracting character strings of special areas such as function names and the like from the code segment; for the data segment, the extracted key content is mainly a character string with special formats such as an IP address, a mailbox address, a website and the like.
One of the data extraction methods is to use a regular expression for matching, and certainly, other same or similar methods can be used to extract information for the same purpose, which is not described herein again.
S205: and generating corresponding PE documents by the first character string set, the second character string set and the third character string set extracted from each PE file according to a preset combination format so as to reduce the number of invalid character strings.
Based on the above extraction steps, this step is intended to generate corresponding PE documents from the first, second, and third character string sets extracted from each PE file according to a preset combination format, so that in the subsequent TFIDF calculation process, the PE documents are used to replace the original PE files, which may greatly reduce the number of meaningless character strings. Since the same character string in the PE document appears only once, when the TFIDF technique is used in this manner, the corresponding value can only be 1 or 0 depending on whether the character string appears in the software under test.
Further, after the extraction operation of the character strings from the respective parts is completed, the number of PE files containing a certain specific character string needs to be counted, and a final statistical result is returned in the form of a map (the map is used for storing a group of one-to-one mapping relations, in the map, one key can only correspond to one value, where the key represents a certain specific character string, and the value represents the number of PE files containing the key).
The embodiment provides a method for extracting meaningful character strings from a PE file in a targeted manner on the basis of the first embodiment, and the workload of subsequent processing steps can be effectively reduced because some redundant and useless character strings are removed.
EXAMPLE III
With reference to fig. 3, fig. 3 is a flowchart of another method for identifying malware based on character strings according to an embodiment of the present disclosure.
The step is intended to provide a specific implementation manner for the first embodiment, and specifically includes the following steps:
s301: for using
Figure BDA0001701976340000121
Calculating the importance degree of each character string extracted from the PE file contained in the software to be tested to each PE file to obtain the weightAn importance assessment parameter;
wherein, TF refers to the number of occurrences of a specific character string in a specific PE file, IDF refers to the number of PE files containing all character strings in the software to be tested, and N refers to the total number of PE files contained in the software to be tested.
For example: the total number of words in a document is 100 and the word "atomic energy" occurs 3 times, then the term "atomic energy" has a TF (word frequency) of 3/100 ═ 0.03 in the document, and one way to calculate the IDF is to divide the total number of documents contained in the document set by the term "atomic energy" determined to be how many documents have occurred. Therefore, if the term "atomic energy" is present in 1000 documents and the total number of documents is 10000000, the IDF is lg (10000000/1000) ═ 4, and finally the TFIDF score of the term "atomic energy" is 0.03 × 4 ═ 0.12.
S302: arranging the importance evaluation parameters from top to bottom in a descending order to obtain an importance evaluation parameter arrangement queue;
s303: selecting a first preset number of importance evaluation parameters from the importance evaluation parameter arrangement queue from top to bottom, taking the corresponding character strings as high-importance character strings, and taking each high-importance character string as a malware identification feature;
in the embodiment, the high-importance character strings are screened out according to different importance degrees in a queue establishing mode and are used as the identification characteristics of the malicious software. The first preset number can be flexibly set according to actual conditions, for example, it is set to 3 or 5, so as to select the character strings with the importance degree ranking 3 or 5.
S304: classifying the identification characteristics of the malicious software by using a theme analysis technology to obtain the identification characteristics after theme classification;
s305: converting the corresponding PE files into topic distribution vectors according to the topic classification recognition characteristics contained in each PE file;
in this embodiment, on the basis of completing the selection of the malware identification features, the subject analysis technology is further used to classify each malware identification feature, and finally, a subject distribution vector is obtained, so as to classify the obtained similar malware identification features.
The topic analysis technology is simple, under the condition of a large number of given documents, even if no manual input exists, the topic analysis technology can be used for automatically classifying words contained in the documents according to topics, and under the condition of a given parameter K, the topic analysis technology can automatically extract K topics and words related to the topics. For example, the topic-related word "sports" may have words such as "football", "basketball", and "Mangan" and "NBA". On the basis of this result, a document can be converted into a K-dimensional vector to represent the distribution of subjects therein by using the subject analysis technique, such as "document No. 1" in which 30% is about sports, 40% is about entertainment, and 30% is about europe.
S306: each topic distribution vector is classified using a linear classification algorithm.
After the extraction, screening and classification of the character strings are completed, each PE file is converted into a feature vector, so that the PE files can be classified directly by using a conventional linear classification algorithm, including but not limited to logistic regression, support vector machine, decision tree, and the like.
Based on the technical scheme, the character string-based malware identification method provided by the embodiment of the application comprises the steps of extracting information of each part in a targeted mode based on a specific format of a PE file, greatly reducing the number of meaningless character strings, carrying out importance degree statistics on each extracted character string by using a TFIDF technology, selecting high-importance character strings with high classification capacity and high identification degree from the extracted character strings, classifying the high-importance character strings in the same category by combining a theme analysis technology, and finally completing identification of malware based on feature vectors obtained through conversion.
It should be noted that, in this embodiment, S301 is a specific implementation manner including but not limited to that given for S101, S302 and S303 are specific implementation manners including but not limited to that given for S102, S306 is an implementation manner for classifying the subject distribution vector obtained after the conversion processing of S304 and S305 by using a linear classifier specifically for S103, all three portions may form a corresponding specific embodiment based on a corresponding embodiment one of the independent claims alone, or may be flexibly combined with some implementations given in an incremental manner in the embodiment one and a scheme given in an incremental manner for extracting a meaningful character string from a PE file in the embodiment two according to all possible special requirements in an actual situation to obtain different specific embodiments, and this embodiment only serves to adopt these three specific implementation manners at the same time, and in order of execution, presents a preferred embodiment.
Because the situation is complicated and cannot be illustrated by a list, a person skilled in the art can realize that many examples exist according to the basic method principle provided by the application and the practical situation, and the protection scope of the application should be protected without enough inventive work.
Referring to fig. 4, fig. 4 is a block diagram illustrating a structure of a malware recognition system based on character strings according to an embodiment of the present disclosure.
The malware identification system may include:
a TFIDF calculating unit 100, configured to calculate, by using a TFIDF technique, importance degrees of each character string extracted from PE files included in software to be tested to each PE file, so as to obtain an importance evaluation parameter;
the malware recognition feature selecting unit 200 is configured to screen the corresponding character strings according to the importance degree according to the importance evaluation parameters to obtain a first preset number of high-importance character strings, and use each high-importance character string as a malware recognition feature; wherein, the size of the importance evaluation parameter is in direct proportion to the importance degree of the character string;
the malware identification unit 300 is configured to perform malware identification on the software to be tested by using each malware identification feature.
TFIDF calculating unit 100 includes:
a formula calculation subunit for utilizing
Figure BDA0001701976340000141
Calculating the importance degree of each character string extracted from the PE file contained in the software to be tested on each PE file to obtain an importance evaluation parameter; wherein, TF is the occurrence frequency of the character string in each PE file, IDF is the number of all PE files containing the character string in the software to be tested, and N is the total number of the PE files contained in the software to be tested.
Wherein, this system still includes:
the sequence arrangement unit is used for arranging the importance evaluation parameters from top to bottom in a descending order to obtain an importance evaluation parameter arrangement queue;
the first identification feature selection subunit is used for selecting a first preset number of importance evaluation parameters from the importance evaluation parameter arrangement queue from top to bottom, and taking the corresponding character strings as high-importance character strings.
Wherein, this system still includes:
the similar merging subunit is used for performing similar merging on the PE files contained in the software to be tested according to different classes contained in the malicious software to obtain a PE file set after merging of all classes;
the merged parameter calculation subunit is used for calculating the importance degree of each PE file in the PE file set after the character strings are merged in each category by utilizing a TFIDF technology to obtain merged importance parameters;
the second identification feature selection subunit is used for correspondingly selecting a second preset number of high-importance character strings from the merged importance parameters according to the descending order and also taking the high-importance character strings as the identification features of the malicious software; wherein the second preset number is less than the first preset number.
Wherein, this system still includes:
the PE file content splitting unit is used for splitting each PE file into a header metadata section, a code section and a data section according to a specific file format of the PE file;
the head content extraction unit is used for extracting the content in the head metadata section by using a metadata parser to obtain a first character string set;
the code segment content extraction unit is used for decompiling the content in the code segment, and extracting a character string containing a preset function name from decompiled data to obtain a second character string set;
the data segment content extraction unit is used for extracting character strings including at least one of an IP address, a mailbox address and a website from the content contained in the data segment to obtain a third character string set;
and the PE document generating unit is used for generating the corresponding PE documents of the first character string set, the second character string set and the third character string set extracted from each PE file according to a preset combination format so as to reduce the number of invalid character strings.
Wherein, this system still includes:
the theme analysis and classification unit is used for classifying the identification characteristics of the malicious software by using a theme analysis technology to obtain the identification characteristics after the theme classification;
and the theme distribution vector generating unit is used for converting the corresponding PE files into theme distribution vectors according to the recognition characteristics after the themes contained in each PE file are classified.
Among them, the malware recognition unit 300 includes:
and the linear classification algorithm classification subunit is used for classifying each topic distribution vector by using a linear classification algorithm.
Based on the foregoing embodiment, the present application further provides a device for identifying malware based on a character string, where the device for identifying malware may include a memory and a processor, where the memory stores a computer program, and when the processor calls the computer program in the memory, the steps provided in the foregoing embodiment may be implemented. Of course, the malware recognition device may also include various necessary network interfaces, power supplies, other components, and the like.
The present application also provides a computer-readable storage medium, on which a computer program is stored, which, when executed by an execution terminal or processor, can implement the steps provided by the above-mentioned embodiments. The storage medium may include: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.
The embodiments are described in a progressive manner in the specification, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. The device disclosed by the embodiment corresponds to the method disclosed by the embodiment, so that the description is simple, and the relevant points can be referred to the method part for description.
Those of skill would further appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative components and steps have been described above generally in terms of their functionality in order to clearly illustrate this interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.
The principles and embodiments of the present application are explained herein using specific examples, which are provided only to help understand the method and the core idea of the present application. It should be noted that, for those skilled in the art, it is possible to make several improvements and modifications to the present application without departing from the principle of the present application, and such improvements and modifications also fall within the scope of the claims of the present application.
It is further noted that, in the present specification, relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

Claims (16)

1. A method for identifying malicious software based on character strings is characterized by comprising the following steps:
calculating the importance degree of each character string extracted from the PE file contained in the software to be tested on each PE file by using a TFIDF technology to obtain an importance evaluation parameter; the character strings extracted from each PE file comprise a first character string set, a second character string set and a third character string set, wherein the first character string set, the second character string set and the third character string set are obtained by respectively extracting a head metadata section, a code section and a data section of the PE file, the second character string set comprises character strings with preset function names, and the third character string set comprises character strings with at least one of IP addresses, mailbox addresses and web addresses;
screening corresponding character strings according to the importance evaluation parameters according to importance degrees to obtain a first preset number of high-importance character strings, and taking each high-importance character string as a malicious software identification feature; wherein the size of the importance evaluation parameter is proportional to the importance degree of the character string;
and identifying the malicious software of the software to be tested by utilizing the identification characteristics of the malicious software.
2. The method of claim 1, wherein the calculating the importance degree of each character string extracted from the PE files included in the software to be tested to each PE file by using TFIDF technique to obtain the importance evaluation parameter includes:
by using
Figure DEST_PATH_IMAGE002
Calculating the importance degree of each character string extracted from the PE file contained in the software to be tested on each PE file to obtain an importance evaluation parameter;
the TF is the occurrence frequency of the character string in each PE file, the IDF is the number of all PE files containing the character string in the software to be tested, and the N is the total number of the PE files contained in the software to be tested.
3. The method of claim 1, wherein the step of screening the corresponding character strings according to the importance degree by using the importance evaluation parameter to obtain a first preset number of high-importance character strings comprises:
arranging all the importance evaluation parameters from top to bottom according to a descending order to obtain an importance evaluation parameter arrangement queue;
and selecting the importance evaluation parameters of the first preset number from the importance evaluation parameter arrangement queue from top to bottom, and taking the corresponding character strings as the high-importance character strings.
4. The method of claim 1, further comprising:
merging the PE files contained in the software to be tested in the same type according to different types contained in the malicious software to obtain a PE file set after merging of all types;
respectively calculating the importance degree of each character string to each PE file in the PE file set after the character strings are merged in each category by utilizing the TFIDF technology to obtain the merged importance parameters;
correspondingly selecting a second preset number of high-importance character strings according to the merged importance parameters from large to small, and using the high-importance character strings as the identification characteristics of the malicious software; wherein the second preset number is less than the first preset number.
5. The method according to any one of claims 1 to 4, before calculating the importance degree of each character string extracted from the PE file included in the software under test to each PE file by using the TFIDF technology, further comprising:
dividing each of the PE files into the header metadata section, the code section, and the data section according to a specific file format constituting the PE file;
extracting the content in the head metadata section by using a metadata parser to obtain the first character string set;
decompiling the content in the code segment, and extracting a character string containing the preset function name from decompiled data to obtain a second character string set;
extracting character strings including at least one of the IP address, the mailbox address and the website from the content contained in the data segment to obtain a third character string set;
and generating corresponding PE documents by the first character string set, the second character string set and the third character string set extracted from each PE file according to a preset combination format so as to reduce the number of invalid character strings.
6. The method of claim 1, further comprising, prior to identifying malware to the software under test using each of the malware-identifying characteristics:
classifying the identification characteristics of the malicious software by using a theme analysis technology to obtain the identification characteristics after theme classification;
and converting the corresponding PE files into topic distribution vectors according to the topic classification identification characteristics contained in each PE file.
7. The method of claim 6, wherein identifying malware to the software under test using each of the malware identification features comprises:
and classifying each topic distribution vector by utilizing a linear classification algorithm.
8. A string-based malware identification system, comprising:
the TFIDF calculation unit is used for calculating the importance degree of each character string extracted from the PE file contained in the software to be tested on each PE file by utilizing a TFIDF technology to obtain an importance evaluation parameter; the character strings extracted from each PE file comprise a first character string set, a second character string set and a third character string set, wherein the first character string set, the second character string set and the third character string set are obtained by respectively extracting a head metadata section, a code section and a data section of the PE file, the second character string set comprises character strings with preset function names, and the third character string set comprises character strings with at least one of IP addresses, mailbox addresses and web addresses;
the malware identification feature selection unit is used for screening the corresponding character strings according to the importance degree according to the importance evaluation parameters to obtain a first preset number of high-importance character strings, and taking each high-importance character string as a malware identification feature; wherein the size of the importance evaluation parameter is proportional to the importance degree of the character string;
and the malicious software identification unit is used for identifying the malicious software of the software to be tested by utilizing the malicious software identification characteristics.
9. The system of claim 8, wherein the TFIDF calculation unit comprises:
a formula calculation subunit for utilizing
Figure 305615DEST_PATH_IMAGE002
Calculating the importance degree of each character string extracted from the PE file contained in the software to be tested on each PE file to obtain an importance evaluation parameter; wherein, the TF is the occurrence frequency of the character string in each PE file, and the IDF is the number of all PE files containing the character string in the software to be testedAnd N is the total number of the PE files contained in the software to be tested.
10. The system of claim 8, further comprising:
the sequence arrangement unit is used for arranging all the importance evaluation parameters from top to bottom according to a descending order to obtain an importance evaluation parameter arrangement queue;
and the first identification feature selection subunit is used for selecting the importance evaluation parameters of the first preset number from the importance evaluation parameter arrangement queue from top to bottom, and taking the corresponding character strings as the high-importance character strings.
11. The system of claim 8, further comprising:
the similar merging subunit is used for performing similar merging on the PE files contained in the software to be tested according to different classes contained in the malicious software to obtain a PE file set after merging of all classes;
the merged parameter calculation subunit is used for respectively calculating the importance degree of each character string in the merged PE file set of each category to each PE file by utilizing the TFIDF technology to obtain a merged importance parameter;
a second identification feature selection subunit, configured to correspondingly select, from a large order to a small order, a second preset number of high-importance character strings from each of the merged importance parameters, and use the selected high-importance character strings as the identification features of the malware; wherein the second preset number is less than the first preset number.
12. The system of any one of claims 8 to 11, further comprising:
a PE file content splitting unit configured to divide each PE file into the header metadata segment, the code segment, and the data segment according to a specific file format constituting the PE file;
a header content extraction unit, configured to extract content in the header metadata segment by using a metadata parser, so as to obtain the first character string set;
a code segment content extraction unit, configured to decompile the content in the code segment, and extract a character string including the preset function name from decompiled data to obtain the second character string set;
a data segment content extracting unit, configured to extract a character string including at least one of the IP address, the mailbox address, and the website from content included in the data segment, so as to obtain the third character string set;
and the PE document generating unit is used for generating corresponding PE documents by the first character string set, the second character string set and the third character string set extracted from each PE file according to a preset combination format so as to reduce the number of invalid character strings.
13. The system of claim 8, further comprising:
the theme analysis and classification unit is used for classifying the identification characteristics of the malicious software by using a theme analysis technology to obtain the identification characteristics after theme classification;
and the theme distribution vector generating unit is used for converting the corresponding PE files into theme distribution vectors according to the recognition characteristics after the themes contained in each PE file are classified.
14. The system of claim 13, wherein the malware identification unit comprises:
and the linear classification algorithm classification subunit is used for classifying each topic distribution vector by using a linear classification algorithm.
15. A character string-based malware recognition apparatus, comprising:
a memory for storing a computer program;
a processor for implementing the steps of the malware identification method of any one of claims 1 to 7 when executing the computer program.
16. A computer-readable storage medium, having stored thereon a computer program which, when being executed by a processor, carries out the steps of the malware identification method of any one of claims 1 to 7.
CN201810639502.7A 2018-06-20 2018-06-20 Character string-based malicious software identification method, system and related device Active CN110619212B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201810639502.7A CN110619212B (en) 2018-06-20 2018-06-20 Character string-based malicious software identification method, system and related device
PCT/CN2019/087563 WO2019242443A1 (en) 2018-06-20 2019-05-20 Character string-based malware recognition method and system, and related devices

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810639502.7A CN110619212B (en) 2018-06-20 2018-06-20 Character string-based malicious software identification method, system and related device

Publications (2)

Publication Number Publication Date
CN110619212A CN110619212A (en) 2019-12-27
CN110619212B true CN110619212B (en) 2022-01-18

Family

ID=68921020

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810639502.7A Active CN110619212B (en) 2018-06-20 2018-06-20 Character string-based malicious software identification method, system and related device

Country Status (2)

Country Link
CN (1) CN110619212B (en)
WO (1) WO2019242443A1 (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115189922B (en) * 2022-06-17 2024-04-09 阿里云计算有限公司 Risk identification method and apparatus, and electronic device

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103473506A (en) * 2013-08-30 2013-12-25 北京奇虎科技有限公司 Method and device of recognizing malicious APK files
CN105740707A (en) * 2016-01-20 2016-07-06 北京京东尚科信息技术有限公司 Malicious file identification method and device
CN105956469A (en) * 2016-04-27 2016-09-21 百度在线网络技术(北京)有限公司 Method and device for identifying file security
CN107315955A (en) * 2016-04-27 2017-11-03 百度在线网络技术(北京)有限公司 File security recognition methods and device

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10621232B2 (en) * 2015-03-11 2020-04-14 Sap Se Importing data to a semantic graph

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103473506A (en) * 2013-08-30 2013-12-25 北京奇虎科技有限公司 Method and device of recognizing malicious APK files
CN105740707A (en) * 2016-01-20 2016-07-06 北京京东尚科信息技术有限公司 Malicious file identification method and device
CN105956469A (en) * 2016-04-27 2016-09-21 百度在线网络技术(北京)有限公司 Method and device for identifying file security
CN107315955A (en) * 2016-04-27 2017-11-03 百度在线网络技术(北京)有限公司 File security recognition methods and device

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于改进的TF-IDF 算法的微博话题;陈朔鹰等;《科技导报》;20160128;第34卷(第2期);第282页-284页3.5 *

Also Published As

Publication number Publication date
CN110619212A (en) 2019-12-27
WO2019242443A1 (en) 2019-12-26

Similar Documents

Publication Publication Date Title
US10445357B2 (en) Document classification system, document classification method, and document classification program
CN111767716A (en) Method and device for determining enterprise multilevel industry information and computer equipment
CN107729520B (en) File classification method and device, computer equipment and computer readable medium
CN105630975B (en) Information processing method and electronic equipment
WO2022121163A1 (en) User behavior tendency identification method, apparatus, and device, and storage medium
CN109165529B (en) Dark chain tampering detection method and device and computer readable storage medium
CN110941959A (en) Text violation detection method, text restoration method, data processing method and data processing equipment
JP4426894B2 (en) Document search method, document search program, and document search apparatus for executing the same
TW201415402A (en) Forensic system, forensic method, and forensic program
JP2010061176A (en) Text mining device, text mining method, and text mining program
WO2014057965A1 (en) Forensic system, forensic method, and forensic program
CN107908649B (en) Text classification control method
TW201508525A (en) Document sorting system, document sorting method, and document sorting program
CN110619212B (en) Character string-based malicious software identification method, system and related device
CN109508557A (en) A kind of file path keyword recognition method of association user privacy
JP2016218512A (en) Information processing device and information processing program
CN117171331A (en) Professional field information interaction method, device and equipment based on large language model
JP2006251975A (en) Text sorting method and program by the method, and text sorter
JP4525433B2 (en) Document aggregation device and program
CN113468339A (en) Label extraction method, system, electronic device and medium based on knowledge graph
Hirsch et al. Evolving rules for document classification
CN112380342A (en) Electric power document theme extraction method and device
CN111079448A (en) Intention identification method and device
JP4690232B2 (en) Information processing apparatus, software registration method, and program
CN113032549B (en) Document sorting method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant